NLP-3 Flashcards

1
Q

define text normalisation

A

Text Normalisationis a process helps in
cleaning up the textual data in such a way that it comes down to a level where its complexity is lower
than the actual data. It comprises of seven steps:
(a) Sentence Segmentation

(b) Tokenisation

(c) Removal of stop words, special characters and numbers

(d) Converting Text to same case

e) stemming/lemmatisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

define corpus

A

a collection of textual data forms a corpus.

In Text Normalization, we undergo several steps to normalizethe text to a lower level. That is, we will be working on text from multiple documents and the term used for the whole textual data from all the documents altogether is known as corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is sentence segmentation

A

Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as
a different data so now the whole corpus gets reduced to sentences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

define token

A

Tokens is a term
used for any word or number or special character occurring in a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what happens in tokenisation

A

After segmenting the sentences, each sentence is then further divided into tokens.Under tokenisation, every
word, number and special character is considered separately and each of them is now a separate
token.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

define stopwords?

A

Stopwords are the words which occur very frequently in the corpus but do not add any value to it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

why are grammatical words considered stopwaord

A

Humans use grammar to make their sentences meaningful for the other person to understand. But
grammatical words do not add any essence to the information which is to be transmitted through the
statement hence they come under stopwords.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what happens in stopword removal

A

stop words like a’an’the,and.are etc occur the most in any given corpus but talk very little or nothing about the context or the
meaning of it. Hence, to make it easier for the computer to focus on meaningful terms, these words
are removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what else is removed along with stop words

A

Along with these words, a lot of times our corpus might have special characters and/or numbers. Now
it depends on the type of corpus that we are working on whether we should keep them in it or not.
For example, if you are working on a document containing email IDs, then you might not want to
remove the special characters and numbers whereas in some other textual data if these characters do
not make sense, then you can remove them along with the stopwords

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is done in case conversion of corpus

A

After the stopwords removal, we convert the whole text into a similar case, preferably lower case.
This ensures that the case-sensitivity of the machine does not consider same words as different just
because of different cases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

explain the process of stemming

A

Stemming is a process by which affixes are removed and a word is converted to its root/base form. The stemmed word may or may not be meaningful.Stemming does not take into account if the stemmed word is meaningful or not. It just removes the
affixes hence it is faster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

explain the process of lemmatisation

A

Lemmatisation is a process by which affixes are removed and a word is converted to its base/root form. The word after affix-removal (or lemma) is always meaningful. Lemmatization makes sure that lemma is a word with meaning and hence it takes a longer time to
execute than stemming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

differece lemmatisaitona dn stmepsnign

A

stemming
- the root word/stemmed word may or may not be meaningful
-takes less time to be executed than lemmatisation
-studied-es= studi
studying-ing=study

lemmatisation
-the root word/lemma is always meaningful
- it takes slightly longer time to be exectued than stemming
-studies-es=study
studying-ing=study

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does the vocabulary of the corpus remain the same before and after text normalization? Give reasons.

A

No, it doesn’t. The process of text normalization reduces the corpus to the minimum vocabulary possible, as the machine doesn’t require grammatically correct sentences, only the essence of the corpus, to function.
In text normalization, stop words, special characters and numbers are removed.
In the processes of stemming and lemmatization, the affixes of words are removed and the word is converted to its base form.
Thus, the vocabulary after text normalization is decreased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly