Natural Language Processing Flashcards by Daniel Casley

What is Natural Language Processing?

The act of handling natural language using some form of computational model

How well did you know this?

Not at all

Perfectly

What is text language similarity?

Detection of language from words in a script

How well did you know this?

Not at all

Perfectly

What is sentiment analysis?

Classifying the emotional content of a message

How well did you know this?

Not at all

Perfectly

What is topic extraction?

Detection of a topic from a script

How well did you know this?

Not at all

Perfectly

What is text summarisation?

Summarisation of text

How well did you know this?

Not at all

Perfectly

What is relationship extraction?

Extracting the relationships between objects in text

How well did you know this?

Not at all

Perfectly

What is question answering?

Answering questions provided to an NLP

How well did you know this?

Not at all

Perfectly

What is language generation?

Generating language given a question

How well did you know this?

Not at all

Perfectly

What is Bayesian spam detection?

A method of detecting the possibility that a message is spam

How well did you know this?

Not at all

Perfectly

How is Bayesian spam detection calculated?

The multiplicative probability that each word in a message is spam, multiplied by the probability of spam, divided by the probability of a message

How well did you know this?

Not at all

Perfectly

What is word variation?

The fact that two words can have the same root/lemma but have different meanings

How well did you know this?

Not at all

Perfectly

What is the lemma of a word?

The ‘root’ of that word - e.g. ‘speak’ and ‘speaks’ have the same lemma

How well did you know this?

Not at all

Perfectly

What is tokenisation?

Separating a sentence or paragraph into different tokens or words

How well did you know this?

Not at all

Perfectly

What is stemming?

Reducing a word to its root form by removing any prefixes and suffixes - e.g. ‘changing’ -> ‘chang’

How well did you know this?

Not at all

Perfectly

What is lemmatisation?

Reducing a word to its original meaning - e.g. ‘was’ -> ‘to be’, ‘changing’ -> ‘change’

How well did you know this?

Not at all

Perfectly

Why may tokenisation be different in English compared to other languages?

Study These Flashcards

Two words may be semantically linked - e.g. in Vietnamese, ‘thoi gian’ means a period of time, but separately they are unrelated (‘thoi’ means a shuttle or buffet and ‘gian’ means time) - or a single token may mean multiple words - e.g. in Japanese, 姉 (or ‘ane’) means ‘older sister’ while 妹 (or ‘imōto’) means ‘younger sister’, but ‘sisters’ is a compound of the two: 姉妹 (pronounced ‘shimai’)

What is meant by a ‘bag of words’ representation?

Study These Flashcards

An entire corpus/document of words is represented simply as the frequency of their stemmed or lemmatised meanings

What is meant by a ‘corpus’ in relation to word processing?

Study These Flashcards

All of the text being used, or a document in its entirety

How do word frequency distributions typically look?

Study These Flashcards

Like a long tail, with connectives at the front and descriptions at the end

To represent data among multiple documents, we could use…

Study These Flashcards

A document-word frequency matrix

We can measure how much a word relates to a type of document by…

Study These Flashcards

Dividing its frequency in that document types by its frequency in other document types

What is one of the main problems with ‘bag of words’ representations?

Study These Flashcards

The matrix generated is typically too sparse - there are too many zeroes

What is a mathematical topic model used for?

Study These Flashcards

Discovering the different topics in a collection of documents

What is Latent Dirichlet Allocation (LDA)?

Study These Flashcards

Separating a document-term matrix into two separate matrices - document-topic, and term-topics - like in 2nd normal form!

What does Latent Dirichlet Allocation (LDA) achieve?

The separation of words into topics, grouping them together to reduce dimensionality

What is one problem with topic modelling?

It is not great for generative modelling, and does not work well for short texts with little topics to discussW

What is word embedding?

Taking every word as a dense vector, where we represent each word with some continuous representation - e.g. the probability that a word will appear in a certain context

How can we perform word embedding?

Create one-hot vectors representing each word and their context, then provide them to an autoencoder network to get embedding values

What can we do to solve the issues posed by single-word encoding (predicting only one word ahead)?

Predicting multiple words ahead for each y instead, which is called a skip-gram

Embedded vectors translate semantic relationships to...

Vector relationships

What is meant by polysemy?

Words that have multiple meanings - e.g. queen, queen or Queen (the ruler, the poker card or the band)

What are sense embeddings?

Unique lemmatisations that help separate the meanings of the words - e.g. "Queen" -> "Queen, the band" and "queen" -> "queen poker card"

What is one way we can adjust our word embeddings for corpora (plural of corpus) with larger scales?

By using sentence embeddings, or even document embeddings, that perform the same sort of task but over a larger corpus!

Natural Language Processing Flashcards

(34 cards)