NLP Flashcards

1
Q

Corpus

A

corpus refers to a large and structured collection of text documents. These documents can be any type of written or spoken language material, such as articles, books, conversations, emails, or any other form of textual data. Corpora (plural of corpus) serve as essential resources for training and evaluating NLP models.
Corpus > Documents > Paragraphs > Sentences > Tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

NLP

A

NLP stands for Natural Language Processing, which is a field of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and contextually appropriate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

NLP applications

A

Text Classification
Machine Translation
Virtual Assistants & chatbots
Sentiment analysis
Spam Detection
Speech Recognition
Text Summarization
Question Answering Systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Tokenization

A

Tokenization is the process of breaking down a text into smaller units, known as tokens. In the context of Natural Language Processing (NLP), these tokens can be words, subwords, or even characters, depending on the level of granularity needed for a particular task. Tokenization is a crucial preprocessing step in NLP that helps in organizing and analyzing textual data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Normalization

A

Normalization refers to the process of making text more consistent and uniform. It involves transforming text data to a standard format, reducing variations and making it easier to analyze. Normalization helps in handling different forms of words or expressions to treat them as the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Stemming

A

Stemming is a natural language processing (NLP) technique used to reduce words to their base or root form, known as the stem. The goal of stemming is to simplify words to a common base form, even if they have different suffixes or prefixes. This process helps in standardizing words and reducing the dimensionality of the vocabulary, making it easier to analyze and process textual data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Lemmatization

A

Lemmatization aims to transform words to their base or dictionary form, known as the lemma.
The lemma represents the canonical or base form of a word, and lemmatization takes into account the word’s morphological analysis, considering factors such as tense, gender, and number. Unlike stemming, lemmatization ensures that the resulting lemma is a valid word.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parsing

A

Parsing, in the context of Natural Language Processing (NLP), refers to the process of analyzing the grammatical structure of a sentence to understand its syntactic components and their relationships. The goal of parsing is to create a hierarchical structure that represents the grammatical relationships between words in a sentence. This structure is often represented as a parse tree or a syntactic tree.
The parsing process involves identifying the parts of speech of words, grouping them into phrases, and determining the syntactic relationships between these phrases. The result is a hierarchical representation that reflects the syntactic structure of the sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Morphological analyzer

A

A morphological analyzer is a linguistic tool or system designed to analyze the morphemes of words in a language. Morphemes are the smallest units of meaning in a language and can be classified into two main types: stems (root forms) and affixes (prefixes, suffixes, infixes, etc.).
The primary goal of a morphological analyzer is to break down words into their constituent morphemes and provide information about their grammatical properties, such as tense, number, gender, case, and so on. Morphological analysis is crucial in understanding the internal structure of words, especially in languages with complex morphological systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Part-of-speech (POS) tagging

A

Part-of-speech (POS) tagging, also known as grammatical tagging, is a natural language processing (NLP) task that involves assigning grammatical categories (parts of speech) to words in a text. The parts of speech include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections. POS tagging is a crucial step in many NLP applications and linguistic analyses because it provides information about the syntactic structure of a sentence
Accurate POS tagging is essential for various downstream NLP tasks as the grammatical structure of a sentence significantly influences its meaning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Bag of Words (BoW)

A

The “Bag of Words” (BoW) is a popular and simple representation model used in Natural Language Processing (NLP) and information retrieval. It is a way of converting text data into numerical vectors that can be used by machine learning algorithms.
The order of words is usually ignored in a Bag of Words representation. The model captures the frequency of words in each document but discards information about word order and structure.
While the Bag of Words model is simple and efficient, it has limitations. It doesn’t capture the semantic meaning of words or their relationships. Advanced models like Word Embeddings and transformer-based models have been developed to address these limitations and provide more sophisticated representations of text data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Term Frequency-Inverse Document Frequency

A

TF/IDF, which stands for Term Frequency-Inverse Document Frequency, is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is commonly employed for feature extraction and text representation in natural language processing (NLP) tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Term Frequency (TF)

A

Measures how often a term (word) occurs in a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in that document. The idea is that words occurring frequently in a document are likely to be important.
TF(t,d) = Number of times term t appears in document d/Total number of terms in document d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Inverse Document Frequency (IDF)

A

Measures the importance of a term across a collection of documents. It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. The idea is to give more weight to terms that are rare across the entire corpus.
IDF(t,D) = log(Total number of documents in the corpus ∣D∣/Number of documents containing term t)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

TF/IDF Score

A

The TF/IDF score for a term in a document is the product of its TF and IDF scores.
TF/IDF(t,d,D)=TF(t,d)×IDF(t,D)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Named Entity Recognition (NER)

A

Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and classifying named entities (specific entities or objects) in text. Named entities are often proper nouns that refer to real-world entities such as persons, organizations, locations, dates, numerical values, percentages, and more. The goal of NER is to extract and categorize these entities to better understand the information present in a text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

K-Nearest Neighbors (KNN)

A

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression tasks. In KNN, a data point is classified or predicted based on the majority class or the average of the k-nearest data points in the feature space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is K and how to choose it?

A

The number of neighbors to consider when making predictions.
Choosing the right number for K is a process called parameter tuning and is important for better accuracy.
The most common practice is sqrt(n) where n is the total number of data points, and an odd value of K is selected to avoid confusion between two classes of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When to use KNN?

A

When:
Data is labelled
Data is noise-free
Dataset is small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does the KNN algorithm work?

A

1- We find the nearest neighbors by calculating the Euclidean distance between the unknown data point and all the other data points from the dataset.
2- We find the nearest neighbors at k.
3- See what the majority of the neighbors are pointing towards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

KNN Advantages

A
  • Simple and intuitive algorithm.
  • No training phase; the model memorizes the training data.
  • Suitable for non-linear relationships in data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

KNN Limitations

A
  • Computationally expensive for large datasets, as it requires calculating distances for each prediction.
  • Sensitive to irrelevant features or noise in the data.
  • The choice of the number of neighbors (k) is crucial and may impact the model’s performance.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Naive Bayes

A

Naive Bayes is a supervised machine learning algorithm based on Bayes’ theorem, which calculates the probability of a hypothesis given the evidence. The “naive” assumption in Naive Bayes is that the features used to describe an observation are conditionally independent, given the class label. Despite its simplicity and assumptions, Naive Bayes is widely used for classification tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Bayes’ Theorem

A

P(A∣B)= P(B∣A)×P(A)/P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Types of Naive Bayes Classifiers

A

Gaussian Naive Bayes
Categorial Naive Bayes
Multinomial Naive Bayes
Bernoulli Naive Bayes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How does the Naive Bayes classifier work?

A

1- Calculate the prior probability for given class labels.
2- Find the Likelihood probability with each attribute for each class
3- Put these values in the Bayes Formula and calculate posterior probability.
4- See which class has a higher probability, given the input belongs to the higher probability class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Advantages of Naive Bayes

A
  • Simple and easy to implement.
  • Performs well with high-dimensional datasets.
  • Efficient for large datasets.
  • Works well with categorical data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Naive Bayes Limitations

A
  • Assumes independence between features, which may not hold in real-world scenarios.
  • Sensitivity to irrelevant features.
  • May not handle well situations where the probability estimates are required to be highly accurate.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Linear regression

A

Linear Regression is a supervised machine learning algorithm used for predicting a continuous outcome variable (dependent variable) based on one or more predictor variables (independent variables). The relationship between the predictor variables and the outcome is assumed to be linear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Linear Regression Advantages

A
  • Simple and easy to understand.
  • Efficient for linear relationships between variables.
  • Provides interpretable coefficients indicating the strength and direction of relationships.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Linear Regression Limitations

A
  • Assumes a linear relationship between predictors and the response variable.
  • Sensitive to outliers, which can strongly influence the model.
  • May not perform well when the relationship between variables is highly non-linear.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Logistic Regression

A

Logistic Regression is a supervised machine learning algorithm used for binary classification tasks. Despite its name, logistic regression is used for classification, not regression. It models the probability that an instance belongs to a particular class using the logistic function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Logistic Regression Advantages

A
  • Simple and efficient for binary classification tasks.
  • Provides probabilities for class membership.
  • Works well with high-dimensional datasets.
34
Q

Logistic Regression Limitations

A
  • Assumes a linear relationship between features and log-odds.
  • Sensitive to outliers.
  • Limited to binary classification (extensions like multinomial logistic regression handle multiple classes).
35
Q

Hidden Markov Model (HMM)

A

A Hidden Markov Model (HMM) is a statistical model used to describe a system that evolves over time and is characterized by unobservable (hidden) states. It is a type of Markov model where the state of the system is not directly observable but can be inferred from observable outputs, which are assumed to depend on the current state.

36
Q

How does the Hidden Markov Model (HMM) work?

A

1- Specify the set of possible hidden states and observations.
2- Establish the initial state distribution.
3- Define the probabilities of transitioning from one state to another.
4- Specify the probabilities of generating each observation from each state.
5- Train the model by parameters of the state transition probabilities.
6- Decode the most probable sequence of hidden states based on the observed data.
7- Evaluate the performance of the HMM.

37
Q

Hidden Markov Model (HMM) Advantages

A
  • Effective for modeling systems with hidden states and observable outputs over time.
  • Widely used in speech recognition, bioinformatics, natural language processing, and other sequential data applications.
38
Q

Hidden Markov Model (HMM) Limitations

A
  • Assumes the Markov property, where the future state depends only on the current state and not on the sequence of events leading to the current state.
  • Sensitive to the quality of training data.
  • Complexity increases with the number of hidden states and observations.
39
Q

Support Vector Machine (SVM)

A

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression tasks. SVM finds a hyperplane in a high-dimensional space that separates data into classes while maximizing the margin between the classes. It is particularly effective in high-dimensional spaces and is known for its ability to handle non-linear relationships through the use of kernel functions.

40
Q

Support Vector Machine (SVM) Advantages

A
  • Effective in high-dimensional spaces.
  • Versatile for both linear and non-linear classification and regression tasks.
  • Robust to overfitting, especially in high-dimensional spaces.
41
Q

Support Vector Machine (SVM) Limitations

A
  • Can be computationally expensive, especially for large datasets.
  • Selection of the appropriate kernel and tuning of hyperparameters are crucial.
  • Interpretability of the model might be challenging.
42
Q

Hyperplane

A

In geometry, a hyperplane is a subspace of one dimension less than its ambient space. In the context of machine learning, particularly in the Support Vector Machine (SVM) algorithm, a hyperplane refers to a separating boundary between classes in a high-dimensional space. In a two-dimensional space, a hyperplane is essentially a line, while in a three-dimensional space, it is a plane.

43
Q

Support vectors

A

Support vectors are the data points that lie closest to the hyperplane, and they play a crucial role in defining the decision boundary. The decision boundary is determined by the hyperplane that maximizes the margin between different classes in a classification problem.

44
Q

Distance margin

A

The distance margin refers to the separation or gap between the decision boundary (hyperplane) and the nearest data point of either class. SVM aims to find the hyperplane that maximizes this distance margin between classes. This margin is crucial for the SVM’s ability to generalize well to new, unseen data.

45
Q

Kernel Function

A

A kernel function takes a (n)d input and transforms it into (n+1)d output to make it easier to place a hyperplane and separate different classes of data.

46
Q

Neural Networks

A

A neural network is a computational model inspired by the structure and functioning of the human brain. It is composed of interconnected nodes, commonly referred to as neurons or artificial neurons, organized in layers. Neural networks are a fundamental component of the field of deep learning, a subset of machine learning.

47
Q

Input data

A

It’s a number that you recorded in the real world somewhere. It’s usually something
that is easily knowable, like today’s temperature, a baseball player’s batting average, or
yesterday’s stock price.

48
Q

Prediction

A

A prediction is what the neural network tells you, given the input data, such as “given the
temperature, it is 0% likely that people will wear sweatsuits today” or “given a baseball player’s batting average, he is 30% likely to hit a home run” or “given yesterday’s stock price, today’s stock price will be 101.52.”

49
Q

How does the neural network learn?

A

Trial and error! First, it tries to make a prediction. Then, it sees whether the prediction was too high or too low. Finally, it changes the weight (up or down) to predict more accurately the next time it sees the same input.

50
Q

Neurons

A

Neurons are the basic building blocks of a neural network. Each neuron receives input, processes it through an activation function, and produces an output.

51
Q

Weights

A

Weights refer to the parameters that the network learns during the training process. Each connection between neurons in adjacent layers is associated with a weight. These weights determine the strength of the connections and play a crucial role in the computation performed by each neuron.

52
Q

Activation Function

A

Each neuron has an activation function that introduces non-linearity to the model.
Common activation functions include:
- The sigmoid: This function transforms the range of combined inputs to a range between 0 and 1.
- Hyperbolic tangent (tanh): This function transforms the range of combined inputs to a range between -1 and 1.
- Rectified linear unit (ReLU): This type of function if you want to get rid of the negative values.

53
Q

Feedforward and Backpropagation

A

In the training process, neural networks use a feedforward pass to make predictions, and then they use backpropagation to update the weights based on the difference between predicted and actual outcomes. This iterative process helps the network improve its performance.

54
Q

Biases

A

Biases are the additional parameter that shift the activation function to the left or right, and that aids the network’s flexibility

55
Q

Loss function

A

A loss function, also known as a cost function or objective function, is a mathematical measure that quantifies the difference between the predicted values of a model and the actual values (ground truth) in a supervised learning task. The goal of training a machine learning model is to minimize this loss function, as it represents how well the model is performing on the given task.

56
Q

Gradient Descent

A

Gradient descent is an algorithm used to find the optimal weights and biases that minimize the loss function, it iteratively adjusts the weights and the biases in the direction that reduces the error most rapidly.

57
Q

Convolutional Neural Network (CNN)

A

Convolutional Neural Networks (CNNs) are a class of deep neural networks designed for processing and analyzing structured grid-like data. They are particularly powerful in tasks related to computer vision, image and video analysis, and pattern recognition. CNNs have been highly successful in various applications, including image classification, object detection, and image generation.

58
Q

Convolutional Layers

A

CNNs are built on the concept of convolutional layers. Convolutional operations involve sliding a set of learnable filters (kernels) across the input data to extract features.

59
Q

Pooling Layers

A

Pooling layers are commonly used in CNNs to down sample the spatial dimensions of the input, reducing the computational complexity and focusing on the most important features. Max pooling and average pooling are common pooling operations.

60
Q

Word Embeddings

A

Word embeddings are vector representations of words in a continuous vector space, where words with similar meanings are located close to each other. These representations are learned through unsupervised learning techniques, capturing semantic relationships between words based on their context in a given corpus. Word embeddings have become a fundamental component in natural language processing (NLP) and machine learning tasks involving textual data.

61
Q

Word2Vec

A

Word2Vec is a popular word embedding technique that learns vector representations by predicting the context words (skip-gram) given a target word or predicting the target word given its context words (continuous bag of words, CBOW).

62
Q

Word Embedding Advantages

A
  • Semantic Similarity: Word embeddings capture semantic similarity, enabling words with similar meanings to have vectors close to each other.
  • Contextual Information: They incorporate contextual information, allowing the model to understand the meaning of a word in a given context.
  • Transfer Learning: Pre-trained word embeddings can be used as features in downstream NLP tasks, providing a boost in performance, especially with limited labeled data.
63
Q

Word Embedding Limitations

A
  • Out-of-Vocabulary Words: It may struggle with out-of-vocabulary words that were not present in the training data.
  • Fixed Dimensionality: Word embeddings have a fixed dimensionality, which may limit their ability to capture highly complex relationships.
  • Lack of Interpretability: The resulting vectors may not provide clear interpretability for human understanding.
64
Q

Recurrent Neural Network (RNN)

A

Recurrent Neural Networks (RNNs) are a type of neural network architecture designed to handle sequential data by maintaining internal memory or hidden states. Unlike feedforward neural networks, RNNs have connections that form directed cycles, allowing them to capture temporal dependencies in sequences. RNNs are widely used in natural language processing (NLP), speech recognition, time series analysis, and other tasks involving sequential data.

65
Q

Sequential Processing

A

RNNs are designed to process sequences of data one element at a time, maintaining hidden states that capture information from previous elements in the sequence.

66
Q

Hidden States

A

Each RNN unit (or cell) has an internal hidden state, which serves as a memory that retains information from previous time steps. The hidden state is updated at each time step based on the current input and the previous hidden state.

67
Q

Vanishing and Exploding Gradients

A

Training deep RNNs can suffer from the vanishing or exploding gradients problem, where gradients become too small or too large during backpropagation. This can make learning long-term dependencies challenging.

68
Q

LSTM (Long Short-Term Memory)

A

LSTM, or Long Short-Term Memory, is a type of recurrent neural network (RNN) architecture designed to address the vanishing gradient problem in traditional RNNs. It is particularly well-suited for tasks involving sequences of data, such as time series prediction, natural language processing, and speech recognition.

69
Q

RNN limitations

A

RNNs may struggle with capturing very long-term dependencies, and training deep RNNs can be computationally expensive. More advanced architectures like attention mechanisms and transformer networks have been proposed to address these challenges.

70
Q

Transformers

A

Transformers are a type of neural network architecture introduced for natural language processing (NLP) tasks. They utilize self-attention mechanisms to capture long-range dependencies in sequential data efficiently. Transformers have become a cornerstone in various NLP applications, including machine translation, text summarization, and language understanding.

71
Q

Positional Encoding

A

As transformers do not inherently capture the sequential order of data, positional encodings are added to the input embeddings to provide information about the positions of words in the sequence.

72
Q

Attention

A

In the context of neural networks, attention refers to a mechanism that allows the model to focus on different parts of the input when making predictions or generating output. Attention mechanisms have been particularly impactful in natural language processing (NLP) and computer vision tasks, enabling models to selectively attend to relevant information and improve their performance on complex tasks.

73
Q

Self-Attention Mechanism

A

The core innovation in transformers is the self-attention mechanism. It allows each word in the input sequence to focus on different parts of the sequence, capturing contextual information effectively.

74
Q

Multi-Head Attention

A

Multi-Head Attention is an extension of the traditional attention mechanism in neural networks, commonly used in models like transformers. It involves using multiple attention heads in parallel to capture different aspects of relationships and dependencies in the input data. The outputs from these multiple heads are typically concatenated or linearly transformed to produce the final attention output.

75
Q

Transformers Advantages

A
  • Efficient Parallelization: Transformers allow for parallel processing of sequences, making them highly efficient and scalable.
  • Capturing Long-Range Dependencies: The self-attention mechanism enables transformers to capture dependencies across long distances in sequences.
  • Transfer Learning: Pre-trained transformer models, such as BERT and GPT, have demonstrated excellent performance in a variety of downstream NLP tasks.
76
Q

Transformers Limitations

A
  • Computational Intensity: Training large transformer models can be computationally intensive.
  • Data Requirements: Transformers often require large amounts of data for effective training.
  • Interpretability: The self-attention mechanism’s complexity may make transformers less interpretable compared to simpler models.
77
Q

BERT

A

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained natural language processing (NLP) model introduced by Google. It belongs to the transformer architecture family and is designed to capture bidirectional context and relationships in language. BERT has achieved state-of-the-art results in various NLP tasks through unsupervised pre-training on large corpora and fine-tuning on specific tasks.

78
Q

Stochastic Gradient Descent (SGD)

A

Stochastic Gradient Descent is a method for fine-tuning a machine learning model by iteratively adjusting its parameters based on the observed errors. Each step is a small move towards improving the model’s performance.

79
Q

Information Retrieval (IR)

A

Information Retrieval (IR) is the process of retrieving relevant information from a large collection of unstructured data based on a user’s query. The goal is to match user queries with documents in a database or corpus and present the most relevant results. IR systems are commonly used in search engines, document retrieval, and question-answering systems.

80
Q

Information Extraction (IE)

A

Information Extraction (IE) is the process of automatically extracting structured information from unstructured text. It involves identifying and extracting specific pieces of information, such as entities, relationships, and events, from large volumes of text data. Information Extraction aims to convert unstructured data into a more structured and usable form.