Topic 3: Machine Learning: Regression, Support Vector Machine & Time Series Models Flashcards
Define information
A quantity which reduces uncertainty about something
Define prediction in the context of data science
A formula for estimating an unknown value of interest: the target
Compare and contrast predictive modeling with descriptive modeling.
Predictive Modelling tries to estimate a value while Descriptive Modelling tries to gain insight into the underlying phenomenon or process.
Define attributes or features
Attributes or features are selected variables used as input to estimate the value of the target variable. In database terminology these are the columns (instances or feature values are the rows).
Describe model induction
the procedure that creates the model from the data is called the induction algorithm or learner.
Induction = generalizing from specific cases to general rules
Contrast induction with deduction
Deduction starts with general rules and specific facts and creates other specific facts from them.
Define the training data and labeled data.
training data is the input data for the induction algorithm. They are called labeled data because the value for the target variable is known.
Describe supervised segmentation
To determine which are the most informative attributes (columns) when predicting the value of the target you can use supervised segmentation.
List the complications arising from selecting informative attributes.
- Attributes rarely split a group perfectly
- Not all attributes are binary
- Some attributes take on numeric values
When is a segmented group considered pure?
If every member of the group has the same value for the garget, then the group is pure.
What do you call outcome of a formula that evaluates how well each attribute splits a set of examples into segments?
purity measure or splitting criterion (most common one is information gain which is based on entropy)
Define entropy
Entropy measures the general disorder of a single set and corresponds to how mixed (impure) the segment is with respect to properties of interest.
high mix = high impurity = high entropy
Calculate the value of entropy
Parent set = 10, 7 non-write off, 3 write off
P(non-write off) = 7/10 = 70%
P(write off) = 3/10 = 30%
entropy = -[0.7 x log2(0.7) + 0.3 x log2(0.3) = 0.88
Define information gain
a measure how much an attribute improves (decreases) entropy over the whole segmentation it creates.
IG -> change in entropy due to any amount of new information added in
Formula information gain
Parent entropy - (weighted average of children’s entropy)
Calculate information gain for a set of children from a parent set
IG(parent, children) = entropy(parent) - [p(c1) x entropy(c2) + p(c2) x entropy(c2) + …]
How does entropy relate to information gain?
entropy is a measure of disorder in the dataset, information gain is a measure of the decrease in disorder achieved by segmenting the original data set
Discuss the issues with the numerical variables for supervised segmentation
Does it make sense to create a segment for each number? Numeric values are often discretized by choosing a split point (e.g. larger than or equal to 50%)
Define variance and discuss its application to numeric variables for supervised segmentation.
Variance is a measure for numerical values. You can look at the information gain by reductions in variance between parents and children.
Define an entropy graph/chart
X-axis proportion of the dataset, Y-axis is the entropy
the shaded area is the entropy when divided by some chosen attribute.
Goal is to decrease the shade.
Describe how an entropy chart can be used to select an informative variable.
Select the attribute which decreases the shaded area the most and does so for most of the values
Define a classification tree and decision nodes.
A classification tree (supervised segmentation) starts with a root node with branches to nodes (decision nodes) and ultimately to a terminal node or leaf.
Define a probability estimation tree, and tree induction.
probability estimation tree -> leafs contain probabilities
tree induction -> at each step select an attribute to partition the current group into subgroups that are as pure as possible with regards to the target variable (e.e.g Oval Body/Square Body)
Define a decision surface or decision boundaries.
Lines separating the regions in an instance space (scatterplot)