Natural Language Processing Flashcards
(34 cards)
What is Natural Language Processing?
The act of handling natural language using some form of computational model
What is text language similarity?
Detection of language from words in a script
What is sentiment analysis?
Classifying the emotional content of a message
What is topic extraction?
Detection of a topic from a script
What is text summarisation?
Summarisation of text
What is relationship extraction?
Extracting the relationships between objects in text
What is question answering?
Answering questions provided to an NLP
What is language generation?
Generating language given a question
What is Bayesian spam detection?
A method of detecting the possibility that a message is spam
How is Bayesian spam detection calculated?
The multiplicative probability that each word in a message is spam, multiplied by the probability of spam, divided by the probability of a message
What is word variation?
The fact that two words can have the same root/lemma but have different meanings
What is the lemma of a word?
The ‘root’ of that word - e.g. ‘speak’ and ‘speaks’ have the same lemma
What is tokenisation?
Separating a sentence or paragraph into different tokens or words
What is stemming?
Reducing a word to its root form by removing any prefixes and suffixes - e.g. ‘changing’ -> ‘chang’
What is lemmatisation?
Reducing a word to its original meaning - e.g. ‘was’ -> ‘to be’, ‘changing’ -> ‘change’
Why may tokenisation be different in English compared to other languages?
Two words may be semantically linked - e.g. in Vietnamese, ‘thoi gian’ means a period of time, but separately they are unrelated (‘thoi’ means a shuttle or buffet and ‘gian’ means time) - or a single token may mean multiple words - e.g. in Japanese, 姉 (or ‘ane’) means ‘older sister’ while 妹 (or ‘imōto’) means ‘younger sister’, but ‘sisters’ is a compound of the two: 姉妹 (pronounced ‘shimai’)
What is meant by a ‘bag of words’ representation?
An entire corpus/document of words is represented simply as the frequency of their stemmed or lemmatised meanings
What is meant by a ‘corpus’ in relation to word processing?
All of the text being used, or a document in its entirety
How do word frequency distributions typically look?
Like a long tail, with connectives at the front and descriptions at the end
To represent data among multiple documents, we could use…
A document-word frequency matrix
We can measure how much a word relates to a type of document by…
Dividing its frequency in that document types by its frequency in other document types
What is one of the main problems with ‘bag of words’ representations?
The matrix generated is typically too sparse - there are too many zeroes
What is a mathematical topic model used for?
Discovering the different topics in a collection of documents
What is Latent Dirichlet Allocation (LDA)?
Separating a document-term matrix into two separate matrices - document-topic, and term-topics - like in 2nd normal form!