Define Natural Language Processing.
Natural Language refers to the way that people communicate to each other using voice, text etc.
Natural Language Processing is the interaction between computers and human language; an automatic manipulation of natural language.
Define Regular Expressions.
Regular Expressions = specially encoded text strings which is used as a pattern for matching sets of strings
Discuss some regex special characters.
Anchors: ^ $
Negation in disjunction [ˆSs] - it means “not a capital S nor s”
etc.
What is text normalization (in general) and what are the three tasks it is composed of?
= a set of tasks where we convert a text in a more convenient and standard form
What are lemma, stem and wordform?
e. g., cat and cats = belong to the same lemma; but have different wordforms
e. g., from “produced”, the lemma is “produce”, but the stem is “produc-“.
What are types and tokens in a sentence?
e.g., they lay back on the San Francisco grass and looked at the stars and their
Explain the Maximum Matching Word Segmentation Algorithm.
Give a string
Normalizing word formats is part of general text normalization. Give an example of normalizing word format.
e.g., we want to match U.S.A and USA, we could delete the periods; or US vs us - we could transform everything to lowercase, but that will be problematic, because they have different meanings
or, we could use asymmetric expansion: Enter: window; Search: window, windows, Windows etc.
Dot is ambiguous in NLP because it can be end of the sentence or in can be present in abbreviations and so on. What can you use to determine whether a dot is the end of the sentence?
A decision tree.
What does Maximum Edit Distance do?
It tells how similar two strings are based on the numbers of edits: insertion, deletion, substitution
Used for spell correction, machine translation etc.
Insertion, deletion and substitution each have a cost of one. then you sum up how many times any of these operations happened and then you get the Maximum Edit Distance
If we want to find the minimum edit distance between two strings then we use a method called backtrace. What is the complexity of this?
O(m+n). linear
m - # characters in word1
n - # characters in word2