Text Analysis Flashcards Preview

Business Analytics > Text Analysis > Flashcards

Flashcards in Text Analysis Deck (20)
Loading flashcards...
1

Numbers and strings that are stored as columns in relational databases or dataframes

Structured data

2

Data as is; data doesn’t fit neatly into a database; text and multimedia content

Unstructured data

3

What is text analysis

Converting textual data into a structured format suitable for analysis

4

Challenges of text analysis

Numerous languages
Variations in usage, grammar, dialects, etc
Hard to understand context, resolve ambiguity, formally encode rules of language, etc

5

Bag of words approach

Each document (bag) is a collection of tokens (words)
The order of words is ignored
Long strings are split into smaller pieces or “tokens”

6

What is a token

A meaningful unit of text, most often a word, that we are interested in using for further analysis

7

What is tokenization

The process of splitting text into tokens

8

What are stop words

Commonly occurring words such as:
A, an, and, after, by, why, your, we, etc.
that are not informative about the document

9

What is stemming

Families of related words with similar meanings can be considered as a single unit by reducing words to their “stem”, base, or root form

10

Ways of describing structured data

Summary statistics (mean, variance, etc)
Visualizations like scatter plot

11

3 ways to describe text

Word count analysis
Word count chart
Word cloud

12

What is word count analysis

Table with tokens in descending order of frequency

13

What is word count chart

Bar chart showing frequency of top N tokens

14

What is word cloud

Visual representation of frequency (or importance) of words in a corpus

15

What is sentiment analysis

Aka opinion mining
The computational study of opinions, sentiments, and emotions expressed in text

16

3 built in dictionaries of Tidytext that give sentiment, emotions, etc, for words

AFINN from Finn Arup Nielsen
Bing from Bing Liu and collaborators
NRC from Saif Mohammad and Peter Turney

17

Goal of topic modeling

To discover the latent topics or factors from a large number of text documents

18

Latent Dirichlet Allocation (LDA)

An unsupervised learning method similar to cluster analysis (where we discover latent groups or clusters)
-used to discover these latent quality dimensions from reviews

19

This is the most common algorithm for topic modeling

Latent Dirichlet Allocation (LDA)

20

Two guiding principles of LDA

Every document is a mixture of topics
Every topic is a mixture of words