NLP Flashcards

Question 1

Q

Corpus

Answer

A

corpus refers to a large and structured collection of text documents. These documents can be any type of written or spoken language material, such as articles, books, conversations, emails, or any other form of textual data. Corpora (plural of corpus) serve as essential resources for training and evaluating NLP models.
Corpus > Documents > Paragraphs > Sentences > Tokens

Question 2

Q

NLP

Answer

A

NLP stands for Natural Language Processing, which is a field of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and contextually appropriate.

Question 3

Q

NLP applications

Answer

A

Text Classification
Machine Translation
Virtual Assistants & chatbots
Sentiment analysis
Spam Detection
Speech Recognition
Text Summarization
Question Answering Systems

Question 4

Q

Tokenization

Answer

A

Tokenization is the process of breaking down a text into smaller units, known as tokens. In the context of Natural Language Processing (NLP), these tokens can be words, subwords, or even characters, depending on the level of granularity needed for a particular task. Tokenization is a crucial preprocessing step in NLP that helps in organizing and analyzing textual data.

Question 5

Q

Normalization

Answer

A

Normalization refers to the process of making text more consistent and uniform. It involves transforming text data to a standard format, reducing variations and making it easier to analyze. Normalization helps in handling different forms of words or expressions to treat them as the same.

Question 6

Q

Stemming

Answer

A

Stemming is a natural language processing (NLP) technique used to reduce words to their base or root form, known as the stem. The goal of stemming is to simplify words to a common base form, even if they have different suffixes or prefixes. This process helps in standardizing words and reducing the dimensionality of the vocabulary, making it easier to analyze and process textual data.

Question 7

Q

Lemmatization

Answer

A

Lemmatization aims to transform words to their base or dictionary form, known as the lemma.
The lemma represents the canonical or base form of a word, and lemmatization takes into account the word’s morphological analysis, considering factors such as tense, gender, and number. Unlike stemming, lemmatization ensures that the resulting lemma is a valid word.

Question 8

Q

Parsing

Answer

A

Parsing, in the context of Natural Language Processing (NLP), refers to the process of analyzing the grammatical structure of a sentence to understand its syntactic components and their relationships. The goal of parsing is to create a hierarchical structure that represents the grammatical relationships between words in a sentence. This structure is often represented as a parse tree or a syntactic tree.
The parsing process involves identifying the parts of speech of words, grouping them into phrases, and determining the syntactic relationships between these phrases. The result is a hierarchical representation that reflects the syntactic structure of the sentence.

Question 9

Q

Morphological analyzer

Answer

A

A morphological analyzer is a linguistic tool or system designed to analyze the morphemes of words in a language. Morphemes are the smallest units of meaning in a language and can be classified into two main types: stems (root forms) and affixes (prefixes, suffixes, infixes, etc.).
The primary goal of a morphological analyzer is to break down words into their constituent morphemes and provide information about their grammatical properties, such as tense, number, gender, case, and so on. Morphological analysis is crucial in understanding the internal structure of words, especially in languages with complex morphological systems.

Question 10

Q

Part-of-speech (POS) tagging

Answer

A

Part-of-speech (POS) tagging, also known as grammatical tagging, is a natural language processing (NLP) task that involves assigning grammatical categories (parts of speech) to words in a text. The parts of speech include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections. POS tagging is a crucial step in many NLP applications and linguistic analyses because it provides information about the syntactic structure of a sentence
Accurate POS tagging is essential for various downstream NLP tasks as the grammatical structure of a sentence significantly influences its meaning.

Question 11

Q

Bag of Words (BoW)

Answer

A

The “Bag of Words” (BoW) is a popular and simple representation model used in Natural Language Processing (NLP) and information retrieval. It is a way of converting text data into numerical vectors that can be used by machine learning algorithms.
The order of words is usually ignored in a Bag of Words representation. The model captures the frequency of words in each document but discards information about word order and structure.
While the Bag of Words model is simple and efficient, it has limitations. It doesn’t capture the semantic meaning of words or their relationships. Advanced models like Word Embeddings and transformer-based models have been developed to address these limitations and provide more sophisticated representations of text data.

Question 12

Q

Term Frequency-Inverse Document Frequency

Answer

A

TF/IDF, which stands for Term Frequency-Inverse Document Frequency, is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is commonly employed for feature extraction and text representation in natural language processing (NLP) tasks.

Question 13

Q

Term Frequency (TF)

Answer

A

Measures how often a term (word) occurs in a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in that document. The idea is that words occurring frequently in a document are likely to be important.
TF(t,d) = Number of times term t appears in document d/Total number of terms in document d

Question 14

Q

Inverse Document Frequency (IDF)

Answer

A

Measures the importance of a term across a collection of documents. It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. The idea is to give more weight to terms that are rare across the entire corpus.
IDF(t,D) = log(Total number of documents in the corpus ∣D∣/Number of documents containing term t)

Question 15

Q

TF/IDF Score

Answer

A

The TF/IDF score for a term in a document is the product of its TF and IDF scores.
TF/IDF(t,d,D)=TF(t,d)×IDF(t,D)

Question 16

Q

Named Entity Recognition (NER)

Answer

A

Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and classifying named entities (specific entities or objects) in text. Named entities are often proper nouns that refer to real-world entities such as persons, organizations, locations, dates, numerical values, percentages, and more. The goal of NER is to extract and categorize these entities to better understand the information present in a text.

Question 17

Q

K-Nearest Neighbors (KNN)

Answer

A

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression tasks. In KNN, a data point is classified or predicted based on the majority class or the average of the k-nearest data points in the feature space.

Question 18

Q

What is K and how to choose it?

Answer

A

The number of neighbors to consider when making predictions.
Choosing the right number for K is a process called parameter tuning and is important for better accuracy.
The most common practice is sqrt(n) where n is the total number of data points, and an odd value of K is selected to avoid confusion between two classes of data

Question 19

Q

When to use KNN?

Answer

A

When:
Data is labelled
Data is noise-free
Dataset is small

Question 20

Q

How does the KNN algorithm work?

Answer

A

1- We find the nearest neighbors by calculating the Euclidean distance between the unknown data point and all the other data points from the dataset.
2- We find the nearest neighbors at k.
3- See what the majority of the neighbors are pointing towards.

Question 21

Q

KNN Advantages

Answer

A

Simple and intuitive algorithm.
No training phase; the model memorizes the training data.
Suitable for non-linear relationships in data.

Question 22

Q

KNN Limitations

Answer

A

Computationally expensive for large datasets, as it requires calculating distances for each prediction.
Sensitive to irrelevant features or noise in the data.
The choice of the number of neighbors (k) is crucial and may impact the model’s performance.

Question 23

Q

Naive Bayes

Answer

A

Naive Bayes is a supervised machine learning algorithm based on Bayes’ theorem, which calculates the probability of a hypothesis given the evidence. The “naive” assumption in Naive Bayes is that the features used to describe an observation are conditionally independent, given the class label. Despite its simplicity and assumptions, Naive Bayes is widely used for classification tasks.

Question 24

Q

Bayes’ Theorem

Answer

A

P(A∣B)= P(B∣A)×P(A)/P(B)

Question 25

Q

Types of Naive Bayes Classifiers

Answer

A

Gaussian Naive Bayes
Categorial Naive Bayes
Multinomial Naive Bayes
Bernoulli Naive Bayes

Question 26

Q

How does the Naive Bayes classifier work?

Answer

A

1- Calculate the prior probability for given class labels.
2- Find the Likelihood probability with each attribute for each class
3- Put these values in the Bayes Formula and calculate posterior probability.
4- See which class has a higher probability, given the input belongs to the higher probability class.

Question 27

Q

Advantages of Naive Bayes

Answer

A

Simple and easy to implement.
Performs well with high-dimensional datasets.
Efficient for large datasets.
Works well with categorical data.

Question 28

Q

Naive Bayes Limitations

Answer

A

Assumes independence between features, which may not hold in real-world scenarios.
Sensitivity to irrelevant features.
May not handle well situations where the probability estimates are required to be highly accurate.

Question 29

Q

Linear regression

Answer

A

Linear Regression is a supervised machine learning algorithm used for predicting a continuous outcome variable (dependent variable) based on one or more predictor variables (independent variables). The relationship between the predictor variables and the outcome is assumed to be linear.

Question 30

Q

Linear Regression Advantages

Answer

A

Simple and easy to understand.
Efficient for linear relationships between variables.
Provides interpretable coefficients indicating the strength and direction of relationships.

Question 31

Q

Linear Regression Limitations

Answer

A

Assumes a linear relationship between predictors and the response variable.
Sensitive to outliers, which can strongly influence the model.
May not perform well when the relationship between variables is highly non-linear.

Question 32

Q

Logistic Regression

Answer

A

Logistic Regression is a supervised machine learning algorithm used for binary classification tasks. Despite its name, logistic regression is used for classification, not regression. It models the probability that an instance belongs to a particular class using the logistic function.

Question 33

Q

Logistic Regression Advantages

Answer

A

Simple and efficient for binary classification tasks.
Provides probabilities for class membership.
Works well with high-dimensional datasets.

Question 34

Q

Logistic Regression Limitations

Answer

A

Assumes a linear relationship between features and log-odds.
Sensitive to outliers.
Limited to binary classification (extensions like multinomial logistic regression handle multiple classes).

Question 35

Q

Hidden Markov Model (HMM)

Answer

A

A Hidden Markov Model (HMM) is a statistical model used to describe a system that evolves over time and is characterized by unobservable (hidden) states. It is a type of Markov model where the state of the system is not directly observable but can be inferred from observable outputs, which are assumed to depend on the current state.

Question 36

Q

How does the Hidden Markov Model (HMM) work?

Answer

A

1- Specify the set of possible hidden states and observations.
2- Establish the initial state distribution.
3- Define the probabilities of transitioning from one state to another.
4- Specify the probabilities of generating each observation from each state.
5- Train the model by parameters of the state transition probabilities.
6- Decode the most probable sequence of hidden states based on the observed data.
7- Evaluate the performance of the HMM.

Question 37

Q

Hidden Markov Model (HMM) Advantages

Answer

A

Effective for modeling systems with hidden states and observable outputs over time.
Widely used in speech recognition, bioinformatics, natural language processing, and other sequential data applications.

Question 38

Q

Hidden Markov Model (HMM) Limitations

Answer

A

Assumes the Markov property, where the future state depends only on the current state and not on the sequence of events leading to the current state.
Sensitive to the quality of training data.
Complexity increases with the number of hidden states and observations.

Question 39

Q

Support Vector Machine (SVM)

Answer

A

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression tasks. SVM finds a hyperplane in a high-dimensional space that separates data into classes while maximizing the margin between the classes. It is particularly effective in high-dimensional spaces and is known for its ability to handle non-linear relationships through the use of kernel functions.

Question 40

Q

Support Vector Machine (SVM) Advantages

Answer

A

Effective in high-dimensional spaces.
Versatile for both linear and non-linear classification and regression tasks.
Robust to overfitting, especially in high-dimensional spaces.

Question 41

Q

Support Vector Machine (SVM) Limitations

Answer

A

Can be computationally expensive, especially for large datasets.
Selection of the appropriate kernel and tuning of hyperparameters are crucial.
Interpretability of the model might be challenging.

Question 42

Q

Hyperplane

Answer

A

In geometry, a hyperplane is a subspace of one dimension less than its ambient space. In the context of machine learning, particularly in the Support Vector Machine (SVM) algorithm, a hyperplane refers to a separating boundary between classes in a high-dimensional space. In a two-dimensional space, a hyperplane is essentially a line, while in a three-dimensional space, it is a plane.

Question 43

Q

Support vectors

Answer

A

Support vectors are the data points that lie closest to the hyperplane, and they play a crucial role in defining the decision boundary. The decision boundary is determined by the hyperplane that maximizes the margin between different classes in a classification problem.

Question 44

Q

Distance margin

Answer

A

The distance margin refers to the separation or gap between the decision boundary (hyperplane) and the nearest data point of either class. SVM aims to find the hyperplane that maximizes this distance margin between classes. This margin is crucial for the SVM’s ability to generalize well to new, unseen data.

Question 45

Q

Kernel Function

Answer

A

A kernel function takes a (n)d input and transforms it into (n+1)d output to make it easier to place a hyperplane and separate different classes of data.

Question 46

Q

Neural Networks

Answer

A

A neural network is a computational model inspired by the structure and functioning of the human brain. It is composed of interconnected nodes, commonly referred to as neurons or artificial neurons, organized in layers. Neural networks are a fundamental component of the field of deep learning, a subset of machine learning.

Question 47

Q

Input data

Answer

A

It’s a number that you recorded in the real world somewhere. It’s usually something
that is easily knowable, like today’s temperature, a baseball player’s batting average, or
yesterday’s stock price.

Question 48

Q

Prediction

Answer

A

A prediction is what the neural network tells you, given the input data, such as “given the
temperature, it is 0% likely that people will wear sweatsuits today” or “given a baseball player’s batting average, he is 30% likely to hit a home run” or “given yesterday’s stock price, today’s stock price will be 101.52.”

Question 49

Q

How does the neural network learn?

Answer

A

Trial and error! First, it tries to make a prediction. Then, it sees whether the prediction was too high or too low. Finally, it changes the weight (up or down) to predict more accurately the next time it sees the same input.

Question 50

Q

Neurons

Answer

A

Neurons are the basic building blocks of a neural network. Each neuron receives input, processes it through an activation function, and produces an output.

Question 51

Q

Weights

Answer

A

Weights refer to the parameters that the network learns during the training process. Each connection between neurons in adjacent layers is associated with a weight. These weights determine the strength of the connections and play a crucial role in the computation performed by each neuron.

Question 52

Q

Activation Function

Answer

A

Each neuron has an activation function that introduces non-linearity to the model.
Common activation functions include:
- The sigmoid: This function transforms the range of combined inputs to a range between 0 and 1.
- Hyperbolic tangent (tanh): This function transforms the range of combined inputs to a range between -1 and 1.
- Rectified linear unit (ReLU): This type of function if you want to get rid of the negative values.

Question 53

Q

Feedforward and Backpropagation

Answer

A

In the training process, neural networks use a feedforward pass to make predictions, and then they use backpropagation to update the weights based on the difference between predicted and actual outcomes. This iterative process helps the network improve its performance.

Question 54

Q

Biases

Answer

A

Biases are the additional parameter that shift the activation function to the left or right, and that aids the network’s flexibility

Question 55

Q

Loss function

Answer

A

A loss function, also known as a cost function or objective function, is a mathematical measure that quantifies the difference between the predicted values of a model and the actual values (ground truth) in a supervised learning task. The goal of training a machine learning model is to minimize this loss function, as it represents how well the model is performing on the given task.

Question 56

Q

Gradient Descent

Answer

A

Gradient descent is an algorithm used to find the optimal weights and biases that minimize the loss function, it iteratively adjusts the weights and the biases in the direction that reduces the error most rapidly.

Question 57

Q

Convolutional Neural Network (CNN)

Answer

A

Convolutional Neural Networks (CNNs) are a class of deep neural networks designed for processing and analyzing structured grid-like data. They are particularly powerful in tasks related to computer vision, image and video analysis, and pattern recognition. CNNs have been highly successful in various applications, including image classification, object detection, and image generation.

Question 58

Q

Convolutional Layers

Answer

A

CNNs are built on the concept of convolutional layers. Convolutional operations involve sliding a set of learnable filters (kernels) across the input data to extract features.

Question 59

Q

Pooling Layers

Answer

A

Pooling layers are commonly used in CNNs to down sample the spatial dimensions of the input, reducing the computational complexity and focusing on the most important features. Max pooling and average pooling are common pooling operations.

Question 60

Q

Word Embeddings

Answer

A

Word embeddings are vector representations of words in a continuous vector space, where words with similar meanings are located close to each other. These representations are learned through unsupervised learning techniques, capturing semantic relationships between words based on their context in a given corpus. Word embeddings have become a fundamental component in natural language processing (NLP) and machine learning tasks involving textual data.

Question 61

Q

Word2Vec

Answer

A

Word2Vec is a popular word embedding technique that learns vector representations by predicting the context words (skip-gram) given a target word or predicting the target word given its context words (continuous bag of words, CBOW).

Question 62

Q

Word Embedding Advantages

Answer

A

Semantic Similarity: Word embeddings capture semantic similarity, enabling words with similar meanings to have vectors close to each other.
Contextual Information: They incorporate contextual information, allowing the model to understand the meaning of a word in a given context.
Transfer Learning: Pre-trained word embeddings can be used as features in downstream NLP tasks, providing a boost in performance, especially with limited labeled data.

Question 63

Q

Word Embedding Limitations

Answer

A

Out-of-Vocabulary Words: It may struggle with out-of-vocabulary words that were not present in the training data.
Fixed Dimensionality: Word embeddings have a fixed dimensionality, which may limit their ability to capture highly complex relationships.
Lack of Interpretability: The resulting vectors may not provide clear interpretability for human understanding.

Question 64

Q

Recurrent Neural Network (RNN)

Answer

A

Recurrent Neural Networks (RNNs) are a type of neural network architecture designed to handle sequential data by maintaining internal memory or hidden states. Unlike feedforward neural networks, RNNs have connections that form directed cycles, allowing them to capture temporal dependencies in sequences. RNNs are widely used in natural language processing (NLP), speech recognition, time series analysis, and other tasks involving sequential data.

Answer 65

A

RNNs are designed to process sequences of data one element at a time, maintaining hidden states that capture information from previous elements in the sequence.

Answer 66

A

Each RNN unit (or cell) has an internal hidden state, which serves as a memory that retains information from previous time steps. The hidden state is updated at each time step based on the current input and the previous hidden state.

Answer 67

A

Training deep RNNs can suffer from the vanishing or exploding gradients problem, where gradients become too small or too large during backpropagation. This can make learning long-term dependencies challenging.

Answer 68

A

LSTM, or Long Short-Term Memory, is a type of recurrent neural network (RNN) architecture designed to address the vanishing gradient problem in traditional RNNs. It is particularly well-suited for tasks involving sequences of data, such as time series prediction, natural language processing, and speech recognition.

Answer 69

A

RNNs may struggle with capturing very long-term dependencies, and training deep RNNs can be computationally expensive. More advanced architectures like attention mechanisms and transformer networks have been proposed to address these challenges.

Answer 70

A

Transformers are a type of neural network architecture introduced for natural language processing (NLP) tasks. They utilize self-attention mechanisms to capture long-range dependencies in sequential data efficiently. Transformers have become a cornerstone in various NLP applications, including machine translation, text summarization, and language understanding.

Answer 71

A

As transformers do not inherently capture the sequential order of data, positional encodings are added to the input embeddings to provide information about the positions of words in the sequence.

Answer 72

A

In the context of neural networks, attention refers to a mechanism that allows the model to focus on different parts of the input when making predictions or generating output. Attention mechanisms have been particularly impactful in natural language processing (NLP) and computer vision tasks, enabling models to selectively attend to relevant information and improve their performance on complex tasks.

Answer 73

A

The core innovation in transformers is the self-attention mechanism. It allows each word in the input sequence to focus on different parts of the sequence, capturing contextual information effectively.

Answer 74

A

Multi-Head Attention is an extension of the traditional attention mechanism in neural networks, commonly used in models like transformers. It involves using multiple attention heads in parallel to capture different aspects of relationships and dependencies in the input data. The outputs from these multiple heads are typically concatenated or linearly transformed to produce the final attention output.

Answer 75

A

Efficient Parallelization: Transformers allow for parallel processing of sequences, making them highly efficient and scalable.
Capturing Long-Range Dependencies: The self-attention mechanism enables transformers to capture dependencies across long distances in sequences.
Transfer Learning: Pre-trained transformer models, such as BERT and GPT, have demonstrated excellent performance in a variety of downstream NLP tasks.

Answer 76

A

Computational Intensity: Training large transformer models can be computationally intensive.
Data Requirements: Transformers often require large amounts of data for effective training.
Interpretability: The self-attention mechanism’s complexity may make transformers less interpretable compared to simpler models.

Answer 77

A

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained natural language processing (NLP) model introduced by Google. It belongs to the transformer architecture family and is designed to capture bidirectional context and relationships in language. BERT has achieved state-of-the-art results in various NLP tasks through unsupervised pre-training on large corpora and fine-tuning on specific tasks.

Answer 78

A

Stochastic Gradient Descent is a method for fine-tuning a machine learning model by iteratively adjusting its parameters based on the observed errors. Each step is a small move towards improving the model’s performance.

Answer 79

A

Information Retrieval (IR) is the process of retrieving relevant information from a large collection of unstructured data based on a user’s query. The goal is to match user queries with documents in a database or corpus and present the most relevant results. IR systems are commonly used in search engines, document retrieval, and question-answering systems.

Answer 80

A

Information Extraction (IE) is the process of automatically extracting structured information from unstructured text. It involves identifying and extracting specific pieces of information, such as entities, relationships, and events, from large volumes of text data. Information Extraction aims to convert unstructured data into a more structured and usable form.

Brainscape's Knowledge GenomeTM

NLP Flashcards

Brainscape's Knowledge Genome^TM