L16 - Topic Modelling Flashcards

1
Q

Define the field of Text Mining…

A

Process of obtaining informative data from unstructured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What type of modelling approach is Topic Modelling?

A

Statistic modelling approach that uses unsupervised machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give a high level explanation of what Topic Modelling does. Give an example…

A

Analyses unstructured text data and clusters the data based on criteria, establishing topics (clusters).

E.g.: Analysis of a text will identify certain topics within the text which would enable the model to predict the purpose of the text such as an invoice, a rock song, a literature review, spam email etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the input and output of Topic Modelling?

A

Bag of words - Corpus (collection of text)

Topics - Clusters of words which are used to make predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 2 topic modelling techniques called?

A

Latent Semantic Analysis (LSA)
Latent Dirichlet Allocation (LDA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Latent Semantic Analysis? Give the step by step process

A
  • A technique to establish the relationship between documents and the words they contain.
  • Based on the assumption that words with similar meanings will appear in similar documents.
  1. Generates a word x document matrix where each row is a word, each col is a document, and each cell is the count of that word in that document.
  2. Perform Singular Value Decomposition on each row to reduce dimensionality whilst retaining important features.
  3. Use Cosine Similarity to establish document similarities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain Latent Dirichlet Allocation…

A

A generative statistical model that assumes that a document contains words that enable the topic of the document to be deduced.

Maps a document to a list of relevant topics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Cosine Similarity?

A

Vector based method for finding document similarity. If Cosine angle between 2 document vectors is close to 1, they are similar.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give an example of Topic Modelling in use…

A

Customer Service Tickets - Based on the content of a customers query, topic modelling can allocate to correct team and appropriately tag the ticket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the advantages and disadvantages of Topic Modelling?

A

Advantages:
- Simple input (term-document matrix)
- Quick and simple topic breakdown by percentage.

Disadvantages:
- Prone to overfitting
- Ineffective on short texts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the main con of topic modelling? What is the solution to this?

A

Ineffective on shorter text

Solution : Word Embedding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly