Chapter 4 Flashcards
(14 cards)
feature engineering time domain:
> what is lambda?
> what does lambda = 1 mean?
lambda is the window size: expresses the amount of discrrete time steps considered
> lambda = 1: consider two instances, the current instance and instance before
time domain: categorical data
> what are two types of temporal patterns?
- succession - one before the other
- co-occurence - occur at same time point
time domain: categorical features
> what is meant by the “support” of the temporal patterns in our data
> how to compute it?
support: how often the pattern occurs in the data compared to the number of time points in our data
> for all instances, check whether the pattern occurs within the selected window size
time domain:categorical
> how to generate valid patterns?
> why is this useful/valid?
- define minimal support threshold theta
- generate all possible patterns of size 1 that meet theta
- iteratively extend possible patterns by 1 until desired size k
> this much more efficient than simply checking all possible combinations
> the support of a new k-pattern can never be greater than the support value of the least supported subpattern it includes
calculating support: what to do with the first lambda instances?
ignore the first lambda instances, as they do not have sufficient history available
feature engineering: time domain
> how to handle mixed data?
derive categorical features from numerical features: two methods:
- if certain ranges are known (low, normal, high)
- if no such information is available, calculate slope
> if slope above certain threshold: increasing/decreasing
> else stable
FT: what is the base frequency?
base frequency:
f0 = 2*pi / lambda +1
> lambda + 1 is the number of data points we consider
> 2*pi is one full sinusoid period
> base frequency is the lowest frequency that can fit a whole period into our window
FT: why do we need lambda +1 frequencies to represent our original sequence?
0 * f0
1 * f0
…
lambda * f0
lambda +1 * f0
>>> starts at zero…
FT: which kinds of features can we derive from FT?
frequency domain features:
- amplitude: frequency with highest amplitude describes the most important frequency in the considered window
- frequency weighted signal average: weighted average frequency within considered window
- power spectral entropy: describes how much information is contained within the signal
> whether there are one or a few discrete frequencies standing out of all others
unstructured data: preprocessing steps
> 4 steps in order to extract attributes from words
- tokenization
> identify sentences and words within sentences
- lower case
> change uppercase to lowercase
- stemming
> identify stem of each word to reduce words to their stem
> map all different variations to a single term
- stop word removal
> remove known stop words as they are not likely to be predictive
explain bag of words
bag of words:
- define n-grams of words (unigrams, bigrams etc)
- count number of occurences for each n-gram in text irrespective the order of appearance
- value for new attribute is number of occurence for that text
> can be binary with only true (occuring in text) and false (not occuring in text)
explain TF-IDF
(term frequency inverse document frequency)
- do bag of words >>> term frequency in document = a
- normalize: divide the total number of instances by the number of instances that contain the n-gram = idf
> the higher the number, the more unique the n-gram is
- compute a*idf = tf_idf
> n-grams that are unique are weighted more
> this avoidy very frequent words to become dominant in our attributes
explain topic modeling
topic modeling: extract more high level topics from text
- assume W words (generated by poisson) and a distribution over topics
- for each word in W select a topic based on the probabilities
- for each word assume that its topic is wrong but the other topics are correct
- probabilistically assign word w to a topic based on
> what topics are in document
> number of times word w assigned to particular topic
- repeat
>>> create on attribute per topic and assign a value based on the observed frequencies of words and weights assigned to the words for the topic
why are overlapping windows an issue?
> solution?
overlapping windows are of course highly correlated
> each window differs just in one point from adjacent instances
> this is likely to cause overfitting
solution: set a maximum overlap for windows and remove instances for which this criterion is not met
(typically 50% overlap is allowed)