C2 Flashcards

1
Q

markup

A

meta information in a text file that is clearly distinguishable from the textual content

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

unicode

A

universal standard for all writing systems, more inclusive than ASCII

for maximum compatibility we encode texts in UTF-8 when reading and writing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

minimum edit distance between two strings

A

minimum number of editing operations (insertion, substitution, deletion) needed to transform one string into another

Levenshtein distance: deletion, insertion and substitution all have a cost of 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

token count

A

number of words in a document, including duplicates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

vocabulary size

A

number of unique terms, feature size when we use words as features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

stop words

A

extremely common words without much content

  • remove stop words: keyword extraction
  • never remove stop words: sequence labelling tasks or classification tasks with small data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

basic word forms

A

reduce number of features and generalizes better

lemma: dictionary form of a word (verbs: infinitive, nouns: singular form)

stem: portion of a word that is common to a set of (inflected) forms when all affixes are removed (not further analyzable into meaningful elements)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

character encoding

A

the way that a computer displays text in a way
that humans can understand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Levenshtein Distance op (i,j)

A

min van:
D(i-1, j) + 1
D(i, j-1) + 1
D(i-1, j-1) + 1 als X(i) neq Y(j)
D(i-1, j-1) + 0 als X(i) = Y(j)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

token

A

an instance of a word or term occurring in a document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

term

A

a token when used as feature (or in an index), generally in normalized form (e.g. lowercased)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Optical Character Recognition

A

a technique for converting the image of a printed text to a digital text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly