L6: NLP - Introduction to Natural Language Processing Flashcards

1
Q

TEXT DATA, TWITTER EXAMPLE

A

{“id”: “1579502253039550466”, “conversation_id”: “1579238912039403521”, “entities”:
{“mentions”: [{“start”: 0, “end”: 14, “username”: “SaucyforTesla”, “id”:
“1334784052386119680”}], “urls”: [{“start”: 171, “end”: 194, “url”: “https://t.co/qrfOZy6wQe”,
“expanded_url”: “https://twitter.com/messages/compose?recipient_id=10850192”,
“display_url”: “twitter.com/messages/compo\u2026”, “status”: 404, “unwound_url”:
“https://twitter.com/messages/compose?recipient_id=10850192”}]}, “author_id”:
“10850192”, “public_metrics”: {“retweet_count”: 0, “reply_count”: 0, “like_count”: 0,
“quote_count”: 0}, “text”: “@SaucyforTesla Hi, Larry. Our team would be happy to discuss
your ideas in more detail. When you have a moment, please feel free to DM us with
additional details. -Rachel https://t.co/qrfOZy6wQe”, “referenced_tweets”: [{“type”:
“replied_to”, “id”: “1579238912039403521”}], “created_at”: “2022-10-10T16:00:57.000Z”,
“edit_history_tweet_ids”: [“1579502253039550466”], “full_text”: “@SaucyforTesla Hi, Larry.
Our team would be happy to discuss your ideas in more detail. When you have a moment,
please feel free to DM us with additional details. -Rachel https://t.co/qrfOZy6wQe”,
“unixTime”: 1665410457.0, “twitter_name”: “GM”}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

TEXT DATA, AMAZON REVIEW EXAMPLE

A

{
‘overall’: 5.0,
‘verified’: True,
‘reviewTime’: ‘03 21, 2014’,
‘reviewerID’: ‘A2HQAG6N6J6WF7’,
‘asin’: ‘6073894996’,
‘reviewerName’: ‘Speed3’,
‘reviewText’: ‘Plug it into your accessory outlet and charge your USB cabled device. I use it for charging my iPhone while I
drive to and from work.’,
‘summary’: ‘It works’,
‘unixReviewTime’: 1395360000,
}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Corpora and Documents

A

:
* Corpora are typically thought of as meaningfully cohesive collections of Documents.
Often a class of documents, e.g.
* From the same source (e.g. amazon reviews)
* From different sources, but on the same topic (different historical sources about
WW2)
* Filtered by some set of criteria (tweets containing the word “ham”, or being a 5-
star review)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Documents are uniform collections of

A
  • Words & sentences (potential features)
  • Auxiliary data, e.g. time, place, and other metainfo (potential features)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

DERIVING DATA FROM DOCUMENTS

A

From unstructured data to meaningful features
Often we want more information than what is in the document text or meta data. E.g.
* How many words/sentences are in it?
* How long ago was it written?
* How long is the text?
* Is the text generally happy or unhappy? What is the sentiment of the text?
This information can often be derived from the documents. We do this by “applying
functions to each document”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

TYPICAL FUNCTIONS WE USE TO EXPLORE TEXT DATA

A

How many words do they write?
How many sentences?
What is the reading difficulty?
Are people generally happy or unhappy?
Anything else anyone can think of?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

LIX SCORE

A

Lix is a useful measure for calculating
how difficult a word is. It measures

How long are sentences

How large a percentage of the words
are longer than 6 characters

Formally,
Word_count / sentence_count +
words_longer_than_6 * 100 /
word_count

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SENTIMENT ANALYSIS

A

Sentiment Analysis is an evaluation of how positive or negative a text is.
The simple way to do it is to have two lists of words:
* Positive words
* Negative words
And then simply count how many of each there are in a text, and add them up.
Alternatively, some words are more negative or positive than others. If we give each of
them a negativity score or a positivity score, we can do a more nuanced analysis of the
sentiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

FROM DESCRIPTION TO CLASSIFICATION (AND PREDICTION)

A

We have a variety of features now. How do we think of features as belonging to a class,
and therefore potentially helpful in terms of making predictions?

The data-related questions:
How do we articulate regularities/relevant features?
How do we spot them?
How do we test them?

The learning related questions?
How do we make predictions based on ”known” feature-class relationships?
To what extent does each feature predict belonging to a specific class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

THE FUNDAMENTAL QUESTIONS IN
MACHINE LEARNING (INCL NLP):

  1. What makes something something, in a way that sets it apart from all the other things it
    could be?
    * What makes a duck a duck, and not a goose?
    * What makes a positive review a positive review, and not a lukewarm review?
    * What makes a customer review negative about service, but positive about food, and
    not vice versa?
  2. How can we use those particularities about the things we are looking at, and make
    them useful?
  3. Either to better understand the things we are looking at,
  4. And/or to make predictions about when something is indeed something, and not
    something else
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

CORE CONCEPTS: FEATURES AND CLASSES

A

Features: Specific (processed
data) that define a thing both in its
own right, and in contrast to other
things

Bird features:
* Length of beak
* Shape of beak
* Color of beak
* Body posture
* Color of feathers
* Size
Etc

Classes: collections of things with
shared features that we want to
think of as a “something

Bird classes:
* Hen
* Duck
* Goose
etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

WHAT’S HARD ABOUT THAT?

A

Features and classes seem easy
enough, right?
For birds, biologists have already
done most of the work for us and
defined features, like birds. They
have structured the data for us.
When we work with unstructured
text data, we often need to identify
and define these ourselves.

{
‘overall’: 5.0,
‘verified’: True,
‘reviewTime’: ‘03 21, 2014’,
‘reviewerID’: ‘A2HQAG6N6J6WF7’,
‘asin’: ‘6073894996’,
‘reviewerName’: ‘Speed3’,
‘reviewText’: ‘Plug it into your accessory
outlet and charge your USB cabled
device. I use it for charging my iPhone
while I drive to and from work.’,
‘summary’: ‘It works’,
‘unixReviewTime’: 1395360000,
}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

HOW?

A

The process of
1. Defining
2. Identifying, and
3. Extracting
features and classes from your
data and structuring it yourself is
often the most difficult part of the
process.
And it is what we will mostly talk
about today.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

BAYES LAW

A

P(A l B) = (P(B l A)P(A) / P(B)

P(A | B) = What is a probability that a document belongs to class A, given that
feature B appears?

P(B | A) = how often do we see the feature in Class A?

P(A) = how often do we see Class A in the corpus in general?

P(B) = how frequent is the feature in general

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

USING BAYES LAW ON A DATASET

A
  1. We take our input, extract features
    from it.
  2. We add KNOWN class label(s) to
    those features.
  3. We tell our Bayesian classifier,
    “these features indicate that the
    item that produced the featureset
    belongs to class X”
  4. The machine updates the model,
    according to Bayes law.
    If we give it a new featureset, it makes
    predictions based on what it knows.

This is a SUPERVISED machine
learning approach.
It is supervised because:
1. Classes are known in advance
2. We know which documents
belong to which class
3. We tell the model how documents
and features are related.
We will do some Unsupervised

EVALUATING A MODEL
A confusion matrix
Evaluation: Precision, Recall, accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

WEAKNESSES OF NAÏVE BAYES

A

So that worked pretty well! But what the weaknesses of bayes?
1. Only categorical classes
* i.e. we can’t estimate likelihoods or continuous outcomes
2. It assumes that all features are independent
* We can ”overload” on some features, (e.g. ‘sentiment score’ AND including ‘great’ or
‘excellent’ in our classifier)
3. It assumes that all features ”weight” the same
* We maybe believe that some features should be more important than others
4. It assumes that we already know what we are looking for!!!
* We provide it labels and data with those labels (supervise it)
* Next we will look at more explorative, unsupervised classification

17
Q

BRIEF PRIMER: VECTORIZATION OF DOCUMENTS

A

Note this is the old fashioned approach: It works, we have better approaches now.
Vectors are long lists of numbers. You may know them from High School Math, e.g.
(4, 0, 1, 9, 0, 12, 4, …, n)
You definitely know them from math as “points” in a cartesian space, e.g.
Point1 = (x1, y1,z1)
Point2 = (x2, y2,z2)
etc

18
Q

FROM DOCUMENT TO VECTORS

A

The traditional way is to create a vector where each dimension corresponds to a word, and
the value in each dimension corresponds to the number of that word in a document.
E.g.
Small corpus: ”I love dogs”, ”I hate dogs”, ”I love cats”, “i play games”
Words = I, love, hate, dogs, cats, play, video, games. Eight dimensions.
Dimensions = [I, cats, dogs, hate, love, play, video, games]
I love cats -> [1, 1, 0, 0, 1, 0, 0, 0]
I love dogs -> [1, 0, 1, 0, 1, 0, 0, 0]
I play video games -> [1, 0, 0, 0, 0, 1, 1, 1]

19
Q

BUT WHY VECTORIZE?

A

We can use points in space to find distances!
Who remembers Pythagora’s?
How do we find the distance between two points, P1 and P2?
A^2 + B^2 = distance^2 <=> distance = sqrt(A^2 + B^2)

20
Q

DISTANCES BETWEEN SENTENCES

A

I love cats -> [1, 1, 0, 0, 1, 0, 0, 0]
I love dogs -> [1, 1, 0, 1, 0, 0, 0, 0]
I play video games -> [1, 0, 0, 0, 0, 1, 1, 1]
What is the distance between each of these sentences? Which ones are most similar? Just
at face value?

21
Q

What is the distance between each of these sentences? Which ones are most similar? Just
at face value?

I love cats -> [1, 1, 0, 0, 1, 0, 0, 0]
I love dogs -> [1, 1, 0, 1, 0, 0, 0, 0]
I play video games -> [1, 0, 0, 0, 0, 1, 1, 1]

A

1 – 2: sqrt((1-1)^2, ( 1-1^)2 , ( 0-0)^2, ( 0-1^)2, (1-0)^2, (0-0)^2, (0-0)^2, (0-0)^2) = 1.44
1- 3: sqrt((1-1)^2, ( 1-1^)2 , ( 0-0)^2, ( 0-1^)2, (1-0)^2, (0-1)^2, (0-1)^2, (0-1)^2) = 2.23
So 1 and 2 are closer to each other using this method. Just as we would expect. Nice!

22
Q

PROBLEMS WITH OLD SCHOOL VECTORIZATION

A
  1. The vectors are ridiculously long, and most of them are 0.
    * For even a small corpus, we have tens or hundreds of thousands of words, most of
    which appear in just a few documents
  2. Different words that mean the same count as completely different words:
    * e.g. i love cats, i hate cats, i like cats -> similar distance.
    * But “i like cats” and “i love cats” should be closer because ‘like’ is closer to ‘love’
    than ‘hate’ (they are more semantically similar)
    Can we vectorize language without having too many 0s, AND take into account semantic
    similarity?
23
Q

THESE WORD VECTORS (EMBEDDINGS)…

A

Show us which words are often used in the same contexts as other words
Allows us to compare words at the individual level by measuring their cosine similarity, e.g.
- dog and cat will be similar
– I take my X to the vet
– I have to go home and feed my X

24
Q

Measuring similarity

A

Given 2 target words v and w

We’ll need a way to measure their similarity

Most measures of vector similarity are based on the:
Dot product or innner product from linear algebra

High when two vectors have large values in the same dimensions

Low (in fact 0) for orthogonal vectors with zeros in complementary distribution

25
Q

GENSIM

A

A python library for word embeddings
Uses a neural net model (or deep lerarning) to learn to
predict a word when being given the surrounding words as
an input
Contains a function called Word2Vec (word to vector),
which
- takes a corpus,
- turns it into a set of word embeddings
Contains a function called Doc2Vec (Document to Vector),
which
- takes a corpus
- turns it into a set of document embeddings
(There is an R-version

26
Q

GENSIM - SKIP-GRAM VS CBOW

A

Skip-gram
Pro: makes it possible for a word to have different
meanings (e.g. Apple the company, apple the fruit)
Con: But has less strong signal
Continuous Bag of Words (CBOW)
Pro: stronger signal
Con: The meaning of an individual word will lie
somewhere in between the different meanings it has
(spatially/geometrically)

27
Q

WHAT IS THE END RESULT?

A

A model that has embedded the semantic similarity of words.
We can use this to:
1. Find the similarity between words and sentences
2. Classify existing text data by finding ”clusters” of texts with high semantic similarity
(unsupervised machine learning)
3. Give it new, unseen data, and classify that

28
Q

TOPIC MODELLING: FROM WORDS TO DOCUMENTS TO
TOPICS

A

If we think of each document as consisting of the
sum of each of its word-vectors, we can also
compare documents.
Topic modelling looks for clusters of documents,
i.e. the vectors that are closest to each other.
Topic modelling can help us
* Confirm our own ideas of what is in the text
* Discover new things that are in the text, that
we did not know about before

29
Q

RUNNING A TOPIC MODEL

A

A topic model is given a set of documents and told how many (n) topics to cluster for.
We do NOT tell it what the topics are, it figures that out for us as well as it can.
Because we do not tell it in advance what the topics are, and because we do not teach it
by showing it true examples of the topics, we call this unsupervised
It returns n lists of words, each of these lists is a topic. The words are ordered in in terms of
their cohesiveness to the topic. (I.e. how special is this word to this topic?)
A topic model gives you a topic cohesiveness score which tells you:
* For each word, how special is this word to this topic?
* For each topic, it gives us the total cohesiveness of that topic

30
Q

WHAT DID WE LEARN ABOUT PRE-LLM NLP

A

Bayes Law can help us calculate:
- how well a set of pre-defined labels can predict a document’s class
- which features are most informative
- if we already have a ground truth

Vectorization of documents can help us:
- project documents into space
- this lets us calculate distances and similarities
- and lets us cluster documents into topics

Between these two methods, we can
* Better identify specific kinds of classes of text
* Describe which words or themes appear in all of them
* Classify or find similar documents, and see what makes them similar

We can evaluate our models in different ways:
- Face value evaluation – does this seem reasonable at all??
- Bayes gives us a neat Confusion Matrix
- Clustering gives us topic coherence

This opens up a wealth of possibilities for exploring texts, and validating hypotheses
about the contents of HUGE corpora of text

31
Q

LARGE LANGUAGE MODELS IN NLP: How do we use LLMs for Natural Language Processing? The same things!

A
  1. Feature extraction (the small things in text we might be interested in)
  2. Classification (the overall text)
  3. Vectorization
    But just slightly differently…
32
Q

LLM VECTORIZATION

A

We start with LLM vectorization
because we won’t spend much time
on it.
The vectors can be used for the same
things as “old school” vectors:
1. Similarity measures
2. Clustering
3. Classification through clustering
They are just better at it because their
semantic representations are better.

33
Q

THE BRITTLENESS OF PROMPTING

A

Prompt sensitivity: slight variations in the phrasing
of the prompt that would not make much
difference to a human can lead to very different
LLM output
(Lu et al. 2021; Zhao et al. 2021)
Of 30 studied LLMs, “all models show significant
sensitivity to the formatting of prompt, the
particular choice of in-context examples, and the
number of in-context examples across all
scenarios and for all metrics.”
(Liang et al. 2022)

34
Q

EMERGENT PROPERTIES?

A

Emergent property/ability: A property that a model exhibits despite not being explicitly trained for it.

E.g. Chain of Thought prompting

35
Q

LLM FEATURE EXTRACTION

A

Unlike with programming languages, LLMs work on natural language. When we want
features extracted, we have to explain to it what we want in words.
When we do feature extraction with LLMs, we:
1. Tell the LLM what to look for
2. Tell the LLM how we want the answer
3. Give the LLM the data we want feature extracted

36
Q

LLM CLASSIFICATION

A

LLMs can also classify text for us.
Instead of telling it what to extract, we tell it what it should classify based on.
In other words, LLMs can be used in the whole process:
1. Give it next text, have it extract features.
2. Feed those features back to LLMs as classes
3. Give it data, and ask it to classify those data

37
Q

IPHONE REVIEWS: ACITIVYT
Download the datafiles containing 100 reviews that contain the word ‘iphone’ from last
time.

Let’s try to answer the same questions we did last time:
1. What are the differences between positive and negative reviews?
2. What characterizes them?
3. Can we tell positive and negative reviews apart from one another?

A

Examples:
1. Adjectives, specific words
2. Numeric evaluations of content
3. Counts of occurrences
4. Overall classes

38
Q

TIPS FOR USING LLMS FOR DATA ANALYSIS

A

Words matter! If GPT is not doing what you want it to do, think of all the ways in which you
can say what you are saying, just in a different way.

Practice makes perfect. Using LLMs is easy to get started on, but hard to get really good at.
Building good habits takes time.

Use the playground first, then do rest programmatically. The power of LLMs is working withdatasets much larger than what we can feed into the model at one time. The only way to
do that is through the Power of Coding.