NLP-5 Flashcards

1
Q

define
i)stop words
ii)frequent words
iii) rare/valuable word

A

i) The words that have highest occurence in the corpus but add negligble value to it are called stop words

ii) The words qhich have an adequate occurence in corpus and have some amount of value are called frequent words. these have adequate occurence and mostly talk about the corpus.

iii) The words that have lowest occurence in the corpus and add highest value to corpus are called rare words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is tfidf score

A

TFIDF stands for Term Frequency and Inverse Document Frequency. TFIDF
helps un in identifying the value and statistical importance for each word.it is a numerical statistic that is intended to reflect howimportant a word is to a document in a collection or corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is term frequency

A

Term frequency is the frequency of a word in one document. Term frequency can easily be found from
the document vector table as in that table we mention the frequency of each word of the vocabulary
in each document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is tf (school)

A

Term Frequency-TF

TF is a measure of how frequently a word, t ,appears in a document, d.
refer to formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is idf

A

IDF measures the importance of a word in a corpus.

It is calculated as follows.

To understand inverse document frequency, first we need to understand document frequency.Document Frequency is the number of documents in which the word occurs irrespective of how many times it has occurred in those documents.In case of inverse document frequency, we need to put the document frequency in the denominator while the total number of documents is the numerator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

summarise the importance of tf-idf score

A

After computing TF and IDF scores separately, the TF- IDF score can be calculated for each word as

The TF- IDF score help the computer to understand the importance of words while processing the NLP.

The higher the value , the more valuable the word is for a given corpus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

summarise tf-odf score

A

Words that occur in all the documents with high term frequencies have the least values and
are considered to be the stopwords.
2. For a word to have high TFIDF value, the word needs to have a high term frequency but less
document frequency which shows that the word is important for one document but is not a
common word for all documents.
3. These values help the computer understand which words are to be considered while
processing the natural language. The higher the value, the more important the word is for a
given corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

applications of tf-idf

A

TFIDF is commonly used in the Natural Language Processing domain.
Document classfication- Hlpes in classifying the type and genre of a doc
topci moddeling- helps in predicting the topic of a given corpus

Information retrieval- It helps in extracting the important information from a corpus

stopword removal- it helps in removing unnecessary words from a given corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly