09. Text Analytics Flashcards

1
Q

What is Lemmatisation

A

Lemmatisation – reduces the inflected words by finding the correct dictionary base/root word that belongs to the language.
In Lemmatisation, root word is called Lemma.
For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is stemming

A

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed, -ize, -s, -de, mis).
Stems are created by removing the suffixes or prefixes used with a word, which is called Suffix/Prefix Stripping.

Sometimes called porters stemming

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the dimensionality in text analysis

A

It is the number of unique terms in the document. Various methods try to reduce this dimensionality to make the analysis simpler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is case folding

A

It means you ignore the difference between capitals and standard text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does tokenizing doing

A

Tokenization is the task of separating (also called tokenizing) words from the body of text. Raw text is converted into collections of tokens after the tokenization, where each token is generally a word.

You need to define how you wish to break the text apart by i.e. punctuation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does parsing mean

A

Parsing: reading an unstructured text and converting it into a formatted data. This normally involves adding structure to the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In text analysis what is meant by search and retrieval

A

Search and retrieval: search specific words/phrases, topics or entities like names of people and organisations into documents in a corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is text mining

A

Text mining: this involves applying analysis methods to discover relationships and patterns in large text collections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is topic modelling

A

A topic consists of a cluster of words that frequently occur together and share the same theme. i.e. fluffy, meeow, purr, paw = the topic of cats. You need to refer to a corpus which would contain pre-labelled topics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does RSS mean

A

Real Simple Syndication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Regular Expressions

A

A method for defining parameters used for text mining i.e. $ is the symbol used to indicate the end of a text string

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Zipf’s Law

A

Vaguely holds true the ith word occurs 1/ith word

1st ranked = 1/1, 2nd ranked = 1/2, 3rd ranked = 1/3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is case folding

A

It ignores the capital letters / lower case detail of text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is information content of words

A

“Stop” words have basically no information content (i.e. the, and etc) these should be removed to improve text analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is TF

A

Term frequency
TF1(t,d)=SUMf(t,ti)
It is a count of the number of times that term appears in the corpus of documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the IDF

A

Inverted Document Frequency

The document frequency is the number of documents in the corpus that contain the term, hence

The inverted document frequency is the inverse of that

17
Q

What is the TFIDF

A

TFIDF (t,d) = TF(t,d) x IDF(t)

If this is higher the better. I high number means that this word is an important word.

18
Q

What is sentiment analysis

A

Looking for opinions, often uses classifiers (niave bayes) and often has a binary result)

19
Q

What is a word cloud

A

An image of the words found in a document with the more common words being bigger. First having removed the stop words.

20
Q

What is part of speech tagging (POS)

A

Changing the words out to the corresponding noun verb etc

21
Q

List some common regular expressions

A

means or
*matches zero or more instances of the previous letter
+ matches one or more instances of the previous letter
{2,4} matches two to four instances of the previous letter
^ means starts with
$means ends with

22
Q

What does bag of words mean

A

All the words in the text but order of words is not preserved