C1 Flashcards

1
Q

3 types of text mining tasks

A
  1. text classification/clustering: assign a category or cluster per document
  2. sequence labelling: assign a category per word in a text
  3. text-to-text generation: input is text, output is text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

4 challenges of text data

A
  1. text data is unstructured
  2. text data can be multi-lingual
  3. text data is noisy
  4. language is infinite
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

bag of words model

A
  • text as classification object
  • each word becomes a feature
  • each term in collection becomes a dimension in the vector space
  • only a few of all words occur in a given document => high dimensional, sparse vectors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

word embeddings

A
  • lower dimensional and dense vector space
  • dimensions are learnt from data (not individually interpretable)
  • similar words are close to each other in the space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

evaluation metrics

A
  • precision: proportion of the assigned labels that are correct
  • recall: proportion of the relevant labels that were assigned
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

precision versus recall bij terroristen schatten

A

precision: hoe veel geschatte terroristen waren niet echt terrorist

recall: hoeveel terroristen heb je gemist door ze niet als terrorist te schatten

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

text mining

A

automatic extraction of knowledge from text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

text mining pipeline for discovering side effects for hypertension medications

A
  1. Filter the data (retrieve relevant messages)
  2. Process the data (clean, anonymize)
  3. Create training data (human labelling)
  4. Identify medication names (named entity recognition)
  5. Identify side effects (named entity recognition)
  6. External knowledge needed (ontology)
  7. Relations between medications and side effects (relation extraction)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Zipf’s law

A

Given a text collection, the frequency of any word is inversely proportional to its rank in the frequency table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

extrinsic evaluation

A

evaluation of complete application
- human vs. automatic
- are humans helped/satisfied by the results?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

intrinsic evaluation

A

evaluation of the components: ground truth labels needed
- existing labels in the data
- human-assigned labels in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly