Lecture 11 Flashcards
Text as data (29 cards)
When does text become text-data?
Text becomes text-data when it is collected for use, analysis, recording
Examples of text structure
*plain sentences
*list of words
*structured table (one row per speech)
*tags/;labels
Heterogenous Structure
The way we store text data depends on how we plan to analyze it
Qualitative Coding of Texts
It is when researchers defines set of categories (policy topic, sentiment) and assigns documents/text into categories by hand
*also called “manual content analysis”
Quantitative Text Analysis
Translating text into numbers (eg. word frequency)
Examples of Quantitative Text Analysis
*Text classification methods
*Scaling methods
*Text Similarity
*Text Reuse Method
Why Text as Data Exist?
Volume and accessibility of text data increased, making it easier to use text for quantitative analysis
What sort of questions can we answer?
*Descriptive
*Inferential
Descriptive Text Analysis
To describe content of text in clear factual way. Summarizing/categorizing what is in the text
eg. count how many times politician says “security”, “border”, “asylum”
Inferential Text Analysis
Relies on text to draw broader generalizable conclusions about political actors, society, hidden attitudes
*eg. use speech to guess politician’s ideology
What is a corpus in text analysis?
A structured collection of texts in a machine-readable format used for analysis
5 steps to quantitative text analysis
- decide research question/objective
- acquire documents
- create a corpus or dataset (machine-readable format)
- pre-process text
- perform analysis
What does a Document-Term Matrix (DTM) represent?
A table where rows = documents, columns = terms, and cells = word frequency in each document.
Why do we pre-process text data?
To reduce noise (irrelevant or meaningless information) and prepare the text for analysis
What are common pre-processing steps in text analysis?
*remove stop words (and, the, in)
*standardize terms (US and USA in same format)
*lowercase words
*lemmatization (reduce word to root format) eg. “running” to “run”
note: step taken is context specific to research objective
What is a “text data frame” under tidy data principles?
A table where each row represents a unit of observation such as a document, sentence, or paragraph
Two common methods of quantitative text analysis
- Dictionary Analysis
- Topic Modelling
What is dictionary analysis in text classification?
An automated method that classifies text based on a predefined list of keywords (dictionary)
1. pre-existing
2. custom made
What are the pros and cons of pre-existing dictionaries?
Easy to use, but may lack validity in different contexts (e.g., movie reviews ≠ political speeches)
What’s the benefit of creating custom dictionaries?
Higher validity since it’s tailored to your specific texts, but more time-consuming.
What is topic modeling?
A method where the computer discovers topics in the text based on word patterns—no predefined categories.
In topic modeling, who assigns meaning to the topics?
The researcher, based on the keywords the computer groups together.
What is text similarity and how is it measured?
It measures how similar documents are, often using cosine similarity based on word usage
What is text reuse?
Identifying identical or copied chunks of text across documents—used in plagiarism detection and tracking info flow.