feature engineering time domain:
> what is lambda?
> what does lambda = 1 mean?
lambda is the window size: expresses the amount of discrrete time steps considered
> lambda = 1: consider two instances, the current instance and instance before
time domain: categorical data
> what are two types of temporal patterns?
1. succession - one before the other
2. co-occurence - occur at same time point
time domain: categorical features
> what is meant by the "support" of the temporal patterns in our data
> how to compute it?
support: how often the pattern occurs in the data compared to the number of time points in our data
> for all instances, check whether the pattern occurs within the selected window size
time domain:categorical
> how to generate valid patterns?
> why is this useful/valid?
1. define minimal support threshold theta
2. generate all possible patterns of size 1 that meet theta
3. iteratively extend possible patterns by 1 until desired size k
> this much more efficient than simply checking all possible combinations
> the support of a new k-pattern can never be greater than the support value of the least supported subpattern it includes
calculating support: what to do with the first lambda instances?
ignore the first lambda instances, as they do not have sufficient history available
feature engineering: time domain
> how to handle mixed data?
derive categorical features from numerical features: two methods:
1. if certain ranges are known (low, normal, high)
2. if no such information is available, calculate slope
> if slope above certain threshold: increasing/decreasing
> else stable
FT: what is the base frequency?
base frequency:
f0 = 2*pi / lambda +1
> lambda + 1 is the number of data points we consider
> 2*pi is one full sinusoid period
> base frequency is the lowest frequency that can fit a whole period into our window
FT: why do we need lambda +1 frequencies to represent our original sequence?
0 * f0
1 * f0
...
lambda * f0
lambda +1 * f0
>>> starts at zero...
FT: which kinds of features can we derive from FT?
frequency domain features:
1. amplitude: frequency with highest amplitude describes the most important frequency in the considered window
2. frequency weighted signal average: weighted average frequency within considered window
3. power spectral entropy: describes how much information is contained within the signal
> whether there are one or a few discrete frequencies standing out of all others
unstructured data: preprocessing steps
> 4 steps in order to extract attributes from words
1. tokenization
> identify sentences and words within sentences
2. lower case
> change uppercase to lowercase
3. stemming
> identify stem of each word to reduce words to their stem
> map all different variations to a single term
4. stop word removal
> remove known stop words as they are not likely to be predictive
explain bag of words
bag of words:
1. define n-grams of words (unigrams, bigrams etc)
2. count number of occurences for each n-gram in text irrespective the order of appearance
3. value for new attribute is number of occurence for that text
> can be binary with only true (occuring in text) and false (not occuring in text)
explain TF-IDF
(term frequency inverse document frequency)
1. do bag of words >>> term frequency in document = a
2. normalize: divide the total number of instances by the number of instances that contain the n-gram = idf
> the higher the number, the more unique the n-gram is
3. compute a*idf = tf_idf
> n-grams that are unique are weighted more
> this avoidy very frequent words to become dominant in our attributes
explain topic modeling
topic modeling: extract more high level topics from text
1. assume W words (generated by poisson) and a distribution over topics
2. for each word in W select a topic based on the probabilities
3. for each word assume that its topic is wrong but the other topics are correct
4. probabilistically assign word w to a topic based on
> what topics are in document
> number of times word w assigned to particular topic
5. repeat
>>> create on attribute per topic and assign a value based on the observed frequencies of words and weights assigned to the words for the topic
why are overlapping windows an issue?
> solution?
overlapping windows are of course highly correlated
> each window differs just in one point from adjacent instances
> this is likely to cause overfitting
solution: set a maximum overlap for windows and remove instances for which this criterion is not met
(typically 50% overlap is allowed)