Chapter 4 Flashcards Preview

ML4QS > Chapter 4 > Flashcards

Flashcards in Chapter 4 Deck (14)
Loading flashcards...

feature engineering time domain:

> what is lambda?

> what does lambda = 1 mean?

lambda is the window size: expresses the amount of discrrete time steps considered

> lambda = 1: consider two instances, the current instance and instance before



time domain: categorical data

> what are two types of temporal patterns?

1. succession - one before the other

2. co-occurence - occur at same time point


time domain: categorical features

> what is meant by the "support" of the temporal patterns in our data

> how to compute it?

support: how often the pattern occurs in the data compared to the number of time points in our data

> for all instances, check whether the pattern occurs within the selected window size


time domain:categorical

> how to generate valid patterns?

> why is this useful/valid?

1. define minimal support threshold theta

2. generate all possible patterns of size 1 that meet theta

3. iteratively extend possible patterns by 1 until desired size k

> this much more efficient than simply checking all possible combinations

> the support of a new k-pattern can never be greater than the support value of the least supported subpattern it includes


calculating support: what to do with the first lambda instances?

ignore the first lambda instances, as they do not have sufficient history available


feature engineering: time domain

> how to handle mixed data?

derive categorical features from numerical features: two methods:

1. if certain ranges are known (low, normal, high)

2. if no such information is available, calculate slope

> if slope above certain threshold: increasing/decreasing

> else stable


FT: what is the base frequency?

base frequency:

f0 = 2*pi / lambda +1

> lambda + 1 is the number of data points we consider

> 2*pi is one full sinusoid period

> base frequency is the lowest frequency that can fit a whole period into our window



FT: why do we need lambda +1 frequencies to represent our original sequence?

0 * f0

1 * f0


lambda * f0

lambda +1 * f0

>>> starts at zero...


FT: which kinds of features can we derive from FT?

frequency domain features:

1. amplitude: frequency with highest amplitude describes the most important frequency in the considered window

2. frequency weighted signal average: weighted average frequency within considered window

3. power spectral entropy: describes how much information is contained within the signal 

> whether there are one or a few discrete frequencies standing out of all others


unstructured data: preprocessing steps

> 4 steps in order to extract attributes from words

1. tokenization

> identify sentences and words within sentences

2. lower case

> change uppercase to lowercase

3. stemming

> identify stem of each word to reduce words to their stem

> map all different variations to a single term

4. stop word removal

> remove known stop words as they are not likely to be predictive


explain bag of words

bag of words:

1. define n-grams of words (unigrams, bigrams etc)

2. count number of occurences for each n-gram in text irrespective the order of appearance

3. value for new attribute is number of occurence for that text

> can be binary with only true (occuring in text) and false (not occuring in text)


explain TF-IDF

(term frequency inverse document frequency)

1. do bag of words >>> term frequency in document = a

2. normalize: divide the total number of instances by the number of instances that contain the n-gram = idf

> the higher the number, the more unique the n-gram is

3. compute a*idf = tf_idf

> n-grams that are unique are weighted more

> this avoidy very frequent words to become dominant in our attributes


explain topic modeling

topic modeling: extract more high level topics from text

1. assume W words (generated by poisson) and a distribution over topics

2. for each word in W select a topic based on the probabilities

3. for each word assume that its topic is wrong but the other topics are correct

4. probabilistically assign word w to a topic based on

> what topics are in document

> number of times word w assigned to particular topic

5. repeat

>>> create on attribute per topic and assign a value based on the observed frequencies of words and weights assigned to the words for the topic


why are overlapping windows an issue?

> solution?

overlapping windows are of course highly correlated

> each window differs just in one point from adjacent instances

> this is likely to cause overfitting

solution: set a maximum overlap for windows and remove instances for which this criterion is not met

(typically 50% overlap is allowed)