Text Analysis Flashcards Preview

Business Analytics > Text Analysis > Flashcards

Flashcards in Text Analysis Deck (20)
Loading flashcards...

Numbers and strings that are stored as columns in relational databases or dataframes

Structured data


Data as is; data doesn’t fit neatly into a database; text and multimedia content

Unstructured data


What is text analysis

Converting textual data into a structured format suitable for analysis


Challenges of text analysis

Numerous languages
Variations in usage, grammar, dialects, etc
Hard to understand context, resolve ambiguity, formally encode rules of language, etc


Bag of words approach

Each document (bag) is a collection of tokens (words)
The order of words is ignored
Long strings are split into smaller pieces or “tokens”


What is a token

A meaningful unit of text, most often a word, that we are interested in using for further analysis


What is tokenization

The process of splitting text into tokens


What are stop words

Commonly occurring words such as:
A, an, and, after, by, why, your, we, etc.
that are not informative about the document


What is stemming

Families of related words with similar meanings can be considered as a single unit by reducing words to their “stem”, base, or root form


Ways of describing structured data

Summary statistics (mean, variance, etc)
Visualizations like scatter plot


3 ways to describe text

Word count analysis
Word count chart
Word cloud


What is word count analysis

Table with tokens in descending order of frequency


What is word count chart

Bar chart showing frequency of top N tokens


What is word cloud

Visual representation of frequency (or importance) of words in a corpus


What is sentiment analysis

Aka opinion mining
The computational study of opinions, sentiments, and emotions expressed in text


3 built in dictionaries of Tidytext that give sentiment, emotions, etc, for words

AFINN from Finn Arup Nielsen
Bing from Bing Liu and collaborators
NRC from Saif Mohammad and Peter Turney


Goal of topic modeling

To discover the latent topics or factors from a large number of text documents


Latent Dirichlet Allocation (LDA)

An unsupervised learning method similar to cluster analysis (where we discover latent groups or clusters)
-used to discover these latent quality dimensions from reviews


This is the most common algorithm for topic modeling

Latent Dirichlet Allocation (LDA)


Two guiding principles of LDA

Every document is a mixture of topics
Every topic is a mixture of words