Topic 1: Regular Expression Flashcards
key components of RE
search, string, pattern, corpus
what is regular expression
language for specifying text search.
expression used to specify a set of strings required for a particular purpose
what is string?
sequence of symbols.
in text based search, string is a sequence of alphanumeric character
what is pattern?
a specific sequence of character/symbols. useful in RE for text searching
a regular expression search require 2 things. what is it?
pattern to search..corpus (text to search through)
what are the application of regular expressions?
- test for a pattern within a string.
- use in database for selecting data
- substitution
what are the basic patterns in RE
- case sensitivie/disjunction..with example
- negation..with example
- range
- RE symbols: ? * +
- RE: disjunction, precedence
types of errors and definition
- false positive
2. false negative
what are the efforts to reduce error rate?
- increase accy / precision
2. increase coverage / recall
what is capture group?
usage of parenthesis storing a pattern in memory.
what is a corpus?
a computer-readable collection of text or speech
brown corpus?
brown sentence?
what is an utterance?
a unit of speech bounded by silence
what are the component in disfluencies?
fragments, filled pauses
give example of fragments and filled pauses
- main mainly
2. uh, uhm
definition of word types and tokens
word types are a numer of distinct word in a corpus..tokens are number of running words
why code switching is required?
speakers often use multiple languages in single communication act. give example
List three task commonly applied as part of any normalization process
- segmenting words..or tokenizing
- normalizing word formats
- segmenting sentences in a running text
example of tokenization in UNIX
- tokenization
- sorting
- merging upper & lower case
- sorting counts
what is tokenization, normalization
- process of segmenting text into words
2. process of making the words into a standard format.
what are the issues in tokenization?
the usage of symbols. give example
goal of tokenizer
- expand clitic contractions
- tokenize multiword expression
- normalized token
case folding
reducing all letters to lower case. however in sentiment analysis, case have sentiments! US vs us is important.
word segmentation with max match algorithm
some language don’t use spaces to mark word-boundaries…chinese, thai, japanese.
standard algo is maxmatch algo
procedure in max match algorithm
- start pointer at the beginning of string.
- find the longest word in dictionary that matches the string starting at pointer.
- move pointer over the word
- repeat at step 2 onwards.