Introduction - Week 1 Flashcards

1
Q

What makes an application a language processing application

A

It requires the use of knowledge about human language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is Unix wc an example of a language processing application?

A

Yes, when it counts words
No, when it counts lines or bytes. Lines and bytes are computer artefacts, not linguistic entities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is google search an NLP application?

A

Yes, it uses knowledge about human languages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is NLP hard?

A

Text is only structured for the human user, often almost fully unstructured for the machine, sometimes ‘semi-structured’ like html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

semi-structured text

A

Text that is partially structured for the machine like HTML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Natural Language Processing (NLP)

A

Necessary steps for “understanding” a piece of data represented by a language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

NLP tasks (umbrella terms)

A
  • Text mining
  • Text analytics
  • Computational Linguistics
  • (human) language technology
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

NLP Tasks and Applications

A

Information Retrieval
- Searching for relevant documents

Document classification
- Sorting documents into categories

Question answering
- Short answer for a question

Text summarisation
- Summarise a set of documents

Sentiment analysis
- Product reviews, Twitter, Hate crime detection

Machine translation
- One of the first motivations for NLP

Natural language generation
- For data to text

Authoring and marking tools
- Check spelling, grammar, style
- Automated marking of essays

Conversational Agents
- Dialogues, voice recognition, Text to speech, speech to Text

etc… (many many others)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

NLP main problems

A

Variability
Ambiguity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Variability

A

Numerous ways to say the same thing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Ambiguity

A

Words and sentences are often ambiguous, and can have multiple meanings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Word-level ambiguity

A

Apple (company) or Apple (fruit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Sentence-level ambiguity

A

I made her duck (this has at least 5 meanings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Lexical Ambiguity

A

A word with multiple POS tags, e.g. Duck can be verb or noun

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Lexical-semantic ambiguity

A

A word with different senses
e.g. bank can be financial institution or part of countryside (river bank)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Syntactic Ambiguity

A

Ambiguity combing from possible word groupings

17
Q

Parts of Speech (POS)

A

nouns, verbs, adjectives, adverbs, pronouns, prepositions, auxiliaries, determiners

18
Q

Open class words

A

nouns, verbs, adjectives, adverbs

19
Q

Closed class words

A

pronouns, prepositions, auxiliaries, determiners

20
Q

Attachment Ambiguity

A

I saw the girl with the telescope
(did he see the girl through a telescope, or see the girl using the telescope?)

21
Q

Coordination ambiguity

A

Old men and women (are the women also old?)

Mother and baby in pram (is the mother in the pram?)

22
Q

Local Ambiguity

A

Police help dog bite victim (Are the police also biting the victim?)

23
Q

Corpus

A

a (large) collection of linguistic data
- May consist of written texts, spoken discourse, samples of spoken or written language

24
Q

unannotated corpus

A

raw text/speech

25
Q

annotated (labelled) corpus

A

raw text/speech enhanced with linguistic information

A repository of explicit linguistic information (added manually or automatically)
e.g. specifying that “loves” in “Mary loves John” is 3rd person singular present tense form of a verb

26
Q

Corpus annotation types

A

Grammatical (e.g. POS tags, noun / verb phrases)

Semantic (e.g. person, drug)

Pragmatics - language in use (e.g. conversation)

Combined

27
Q

Why do we need annotated corpora?

A

For training and evaluation

Training:
- Train linguists, language learners, etc…
- Researchers (e.g. in language development)
- NLP development: use ML/statistics to learn patterns from an annotated corpus

NLP evaluation:
- Compare NLP results (automated “annotations”) with a manually coded “gold standard”

28
Q

Do people agree on annotations?

A

No, sometimes very subjective and inconsistent, very difficult to get a gold standard corpus

Some simple tasks are relatively consistent, e.g. what’s the name of the lecture?
Or inconsistent, e.g. sentiment analysis

29
Q

Annotation agreement

A

Kappa - measures the agreement between two classifiers, who classify N items into C mutually exclusive categories

k = (Pr(a) - Pr(e)) / (1-Pr(e))

Pr(a) = relative observed agreement among annotators

Pr(e) = hypothetical probability of chance agreement - typically uses the observed data to calculate the probabilities of each observer randomly saying each category

If in complete agreement then k=1, if completely disagree, other than what would be expected by chance (as defined by Pr(e)) then k=0

30
Q

Precision

A

Fraction of retrieved documents that are relevant

relevant items retrieved / retrieved items

P(relevant|retrieved)

31
Q

Recall

A

Fraction of relevant documents that are retrieved

relevant items retrieved / relevant items

P(retrieved | relevant)

32
Q

F-measure

A

Weighted harmonic mean between Precision and Recall, trades-off the two

F = 2PR/(P + R)

33
Q

Cross-validation

A

Break up data into n folds
- Equal positive and negative in each fold?

For each fold:
- Choose the fold as a temporary test set
- Train on the other n-1 folds, compute performance on the test fold

Report average performance of the n runs

Sensible value for n might be 10