NLP Flashcards
Corpus
corpus refers to a large and structured collection of text documents. These documents can be any type of written or spoken language material, such as articles, books, conversations, emails, or any other form of textual data. Corpora (plural of corpus) serve as essential resources for training and evaluating NLP models.
Corpus > Documents > Paragraphs > Sentences > Tokens
NLP
NLP stands for Natural Language Processing, which is a field of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and contextually appropriate.
NLP applications
Text Classification
Machine Translation
Virtual Assistants & chatbots
Sentiment analysis
Spam Detection
Speech Recognition
Text Summarization
Question Answering Systems
Tokenization
Tokenization is the process of breaking down a text into smaller units, known as tokens. In the context of Natural Language Processing (NLP), these tokens can be words, subwords, or even characters, depending on the level of granularity needed for a particular task. Tokenization is a crucial preprocessing step in NLP that helps in organizing and analyzing textual data.
Normalization
Normalization refers to the process of making text more consistent and uniform. It involves transforming text data to a standard format, reducing variations and making it easier to analyze. Normalization helps in handling different forms of words or expressions to treat them as the same.
Stemming
Stemming is a natural language processing (NLP) technique used to reduce words to their base or root form, known as the stem. The goal of stemming is to simplify words to a common base form, even if they have different suffixes or prefixes. This process helps in standardizing words and reducing the dimensionality of the vocabulary, making it easier to analyze and process textual data.
Lemmatization
Lemmatization aims to transform words to their base or dictionary form, known as the lemma.
The lemma represents the canonical or base form of a word, and lemmatization takes into account the word’s morphological analysis, considering factors such as tense, gender, and number. Unlike stemming, lemmatization ensures that the resulting lemma is a valid word.
Parsing
Parsing, in the context of Natural Language Processing (NLP), refers to the process of analyzing the grammatical structure of a sentence to understand its syntactic components and their relationships. The goal of parsing is to create a hierarchical structure that represents the grammatical relationships between words in a sentence. This structure is often represented as a parse tree or a syntactic tree.
The parsing process involves identifying the parts of speech of words, grouping them into phrases, and determining the syntactic relationships between these phrases. The result is a hierarchical representation that reflects the syntactic structure of the sentence.
Morphological analyzer
A morphological analyzer is a linguistic tool or system designed to analyze the morphemes of words in a language. Morphemes are the smallest units of meaning in a language and can be classified into two main types: stems (root forms) and affixes (prefixes, suffixes, infixes, etc.).
The primary goal of a morphological analyzer is to break down words into their constituent morphemes and provide information about their grammatical properties, such as tense, number, gender, case, and so on. Morphological analysis is crucial in understanding the internal structure of words, especially in languages with complex morphological systems.
Part-of-speech (POS) tagging
Part-of-speech (POS) tagging, also known as grammatical tagging, is a natural language processing (NLP) task that involves assigning grammatical categories (parts of speech) to words in a text. The parts of speech include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections. POS tagging is a crucial step in many NLP applications and linguistic analyses because it provides information about the syntactic structure of a sentence
Accurate POS tagging is essential for various downstream NLP tasks as the grammatical structure of a sentence significantly influences its meaning.
Bag of Words (BoW)
The “Bag of Words” (BoW) is a popular and simple representation model used in Natural Language Processing (NLP) and information retrieval. It is a way of converting text data into numerical vectors that can be used by machine learning algorithms.
The order of words is usually ignored in a Bag of Words representation. The model captures the frequency of words in each document but discards information about word order and structure.
While the Bag of Words model is simple and efficient, it has limitations. It doesn’t capture the semantic meaning of words or their relationships. Advanced models like Word Embeddings and transformer-based models have been developed to address these limitations and provide more sophisticated representations of text data.
Term Frequency-Inverse Document Frequency
TF/IDF, which stands for Term Frequency-Inverse Document Frequency, is a numerical statistic used in information retrieval and text mining to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is commonly employed for feature extraction and text representation in natural language processing (NLP) tasks.
Term Frequency (TF)
Measures how often a term (word) occurs in a document. It is calculated as the ratio of the number of times a term appears in a document to the total number of terms in that document. The idea is that words occurring frequently in a document are likely to be important.
TF(t,d) = Number of times term t appears in document d/Total number of terms in document d
Inverse Document Frequency (IDF)
Measures the importance of a term across a collection of documents. It is calculated as the logarithm of the total number of documents divided by the number of documents containing the term. The idea is to give more weight to terms that are rare across the entire corpus.
IDF(t,D) = log(Total number of documents in the corpus ∣D∣/Number of documents containing term t)
TF/IDF Score
The TF/IDF score for a term in a document is the product of its TF and IDF scores.
TF/IDF(t,d,D)=TF(t,d)×IDF(t,D)
Named Entity Recognition (NER)
Named Entity Recognition (NER) is a natural language processing (NLP) task that involves identifying and classifying named entities (specific entities or objects) in text. Named entities are often proper nouns that refer to real-world entities such as persons, organizations, locations, dates, numerical values, percentages, and more. The goal of NER is to extract and categorize these entities to better understand the information present in a text.
K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression tasks. In KNN, a data point is classified or predicted based on the majority class or the average of the k-nearest data points in the feature space.
What is K and how to choose it?
The number of neighbors to consider when making predictions.
Choosing the right number for K is a process called parameter tuning and is important for better accuracy.
The most common practice is sqrt(n) where n is the total number of data points, and an odd value of K is selected to avoid confusion between two classes of data
When to use KNN?
When:
Data is labelled
Data is noise-free
Dataset is small
How does the KNN algorithm work?
1- We find the nearest neighbors by calculating the Euclidean distance between the unknown data point and all the other data points from the dataset.
2- We find the nearest neighbors at k.
3- See what the majority of the neighbors are pointing towards.
KNN Advantages
- Simple and intuitive algorithm.
- No training phase; the model memorizes the training data.
- Suitable for non-linear relationships in data.
KNN Limitations
- Computationally expensive for large datasets, as it requires calculating distances for each prediction.
- Sensitive to irrelevant features or noise in the data.
- The choice of the number of neighbors (k) is crucial and may impact the model’s performance.
Naive Bayes
Naive Bayes is a supervised machine learning algorithm based on Bayes’ theorem, which calculates the probability of a hypothesis given the evidence. The “naive” assumption in Naive Bayes is that the features used to describe an observation are conditionally independent, given the class label. Despite its simplicity and assumptions, Naive Bayes is widely used for classification tasks.
Bayes’ Theorem
P(A∣B)= P(B∣A)×P(A)/P(B)
Types of Naive Bayes Classifiers
Gaussian Naive Bayes
Categorial Naive Bayes
Multinomial Naive Bayes
Bernoulli Naive Bayes
How does the Naive Bayes classifier work?
1- Calculate the prior probability for given class labels.
2- Find the Likelihood probability with each attribute for each class
3- Put these values in the Bayes Formula and calculate posterior probability.
4- See which class has a higher probability, given the input belongs to the higher probability class.
Advantages of Naive Bayes
- Simple and easy to implement.
- Performs well with high-dimensional datasets.
- Efficient for large datasets.
- Works well with categorical data.
Naive Bayes Limitations
- Assumes independence between features, which may not hold in real-world scenarios.
- Sensitivity to irrelevant features.
- May not handle well situations where the probability estimates are required to be highly accurate.
Linear regression
Linear Regression is a supervised machine learning algorithm used for predicting a continuous outcome variable (dependent variable) based on one or more predictor variables (independent variables). The relationship between the predictor variables and the outcome is assumed to be linear.
Linear Regression Advantages
- Simple and easy to understand.
- Efficient for linear relationships between variables.
- Provides interpretable coefficients indicating the strength and direction of relationships.
Linear Regression Limitations
- Assumes a linear relationship between predictors and the response variable.
- Sensitive to outliers, which can strongly influence the model.
- May not perform well when the relationship between variables is highly non-linear.
Logistic Regression
Logistic Regression is a supervised machine learning algorithm used for binary classification tasks. Despite its name, logistic regression is used for classification, not regression. It models the probability that an instance belongs to a particular class using the logistic function.