Machine Learning Flashcards

Question

Mean squared probability error

Answer 1

This is one way to estimate the error in the probability assigned to an example x, knowing the true class probability c(x). It's valued as the following MSE = 0.5* ||p*(x) - Ic(x)||2,2 = = 0.5 * sum ( pi(x) - Ic(xi) )2 where: - p*(x) is the probability vector for the classes for the example x - Ic(x) is the vector that has 1 for the right class position, and 0 otherwise - the ||A||2,2 operation is the euclidean length for vector A, calculated as squared(squareRoot( sumOnEach_i( Ai * Ai) ))

Answer 2

We can use empirical probabilities to define a reference for a class probability estimation. We can define the EMPIRICAL PROBABILITY VECTOR as follows p^(S) = (n1/|S|, n2/|S|, ...., nk/|S|) where S is the set of labeled examples ni is the number of the examples labeled on the class Ci

Answer 3

It's the most common way to smooth the relative frequencies of classes for the caluclation of EMPIRICAL PROBABILITIES. Can occour uniformly by adding 1 to each one of the k class occurrencies, so the empirical probability becomes: p^i(S) = (ni + 1) / |S| + k Can also be done with specific weights for each class,: p^i(S) = (ni + wi*m) / |S| + m where wi is the a priori probability of the class i m > k is any number k is the number of classes sum(wi) is 1

Answer 4

A binary classifier can be used to generate a multiclass labeling. We have various approaches: - one vs rest * unordered learning * fixed-order learning - one vs one * symmetric * asymmentric

Answer 5

The classifier is made of n classifiers where n is the number of classes. Each classifier will be a binary classifier that returns 1 if the class is found, -1 if it thinks its another any class. We can define an OUTPUT CODE MATRIX for this classifier (with n=3): +1 -1 -1 -1 +1 -1 -1 -1 +1 where the columns are the results of each classifier on classes C1,C2,C3 and the rows are the classes C1,C2,C3

Answer 6

The classifier is made of n-1 classifiers, where n is the number of classes. Each classifier is a binary classifier, but we use the classifiers with a fixed order ASSUMING that each non classified example at k-th classifier cannot be of the k classes checked by the previous classifiers. So for example: 3 classes. Checks are: 1. C1 vs C2,C3 -> is C1 or go to step 2. 2. C2 vs C3 -> C2 or C3 For this example the OUTPUT CODE MATRIX will be: +1 0 -1 +1 -1 -1 NB: We see that there is a THIRD VALUE, the 0 returned by the binary classifier, it's not really a value returned since we do not accept that as a prediction.

Answer 7

The classifier is made of combination of n classifiers where n is the number of classes. So one classifier for each possible couple of classes. Each classifier is a binary classifier, each classifier determines if the example is inside one of two classes.

Answer 8

The classifier is made of combination of two classifier for each possible couple of classes. Each classifier is a binary classifier, each classifier determines if the example is inside one of two classes.

Answer 9

When the classifier is trained and we have the OUTPUT CODE MATRIX available, we can use it to define the classification for a new example given to the model. Given n classes, we receive a WORD wn, a vector of dimension n. We can choose the class for the given example as the one that MINIMIZES the distance between wn and the corresponding row on the OUTPUT CODE MATRIX. Distance on Cx = D[Cx] = SUM[i](1 - ci*wi) / 2 where ci are the values of the selected row, and wi are the values of the vector w. For example: +1 -1 -1 -1 +1 -1 -1 -1 +1 w = (-1, -1, +1) D[C1] = 1-(-1) + 1-1 + 1-(-1) = 2 D[C2] = 1-(1) + 1-(-1) + 1-(1) = 2 D[C3] = 1-(1) + 1-(1) + 1-(1) = 0 So the row with less Distance is 3, so the data has to be labeled as C3

Answer 10

If in the decoding part, with the task of classifing new data, i get more rows with the same distance i can choose: 1. decide randomly between the rows with same minimum distance 2. lessen the possibility of same score by adding new classifier by merging more type of multiclass binary classifiers (ex. i add asimetric columns) 3. instead of classifiers, that for definition returns 1 if the right class and -1 (or 0) if not, i can use a scoring classifier or a probabilistic classifier. This way the distance is less probable to be the same.

Answer 11

This is a task in which the input is a data in the example set, and the output is a number in R. We do not want to understand a class, but we want to find a point on a function. So given the train set (xi, f(xi)) we want to define a function f* that is the more similar to f(xi), given f*(xi) ATTENTION To solve the task during the training the easiest way is to define f* as a polynome with n-1 parameters, where n is the number of example in the train set. This way f* and f would match perfectly, but this would be OVERFITTING, since f* would be exessively defined on the data from the example. So rule of thumb is to have a smaller number of parameters than the example set dimension (this does not apply on neural networks)

Answer 12

If we underestimate the number of parameters we won't be able to decrease the LOSS FUNCTION no matter how many examples we use in training. If we overestimate the parameters the model will be more dependent on the training set, being less flexible to new cases.

Answer 13

Concept learning is the task of inferring a boolean(?)-valued function from training examples of its input and output. It involves identifying a general category or concept from specific instances.

Answer 14

The hypothesis space is the set of all hypotheses that can be formulated given the learning algorithm and the data.

Answer 15

The version space is the subset of the hypothesis space that is consistent with the training examples. It includes all hypotheses that correctly classify the training data.

Answer 16

The Candidate Elimination Algorithm is an algorithm that iteratively refines the version space by eliminating hypotheses that are inconsistent with each new training example.

Answer 17

The General Boundary (G) is the set of the most general hypotheses in the version space that are consistent with the training data.

Answer 18

The Specific Boundary (S) is the set of the most specific hypotheses in the version space that are consistent with the training data.

Answer 19

Inductive bias refers to the set of assumptions a learning algorithm makes to predict outputs given inputs that it has not encountered. These assumptions guide the learning process and influence the hypothesis chosen.

Answer 20

Positive examples provide instances that belong to the concept being learned, while negative examples provide instances that do not belong to the concept. Both are crucial for defining and refining the concept.

Answer 21

Overfitting occurs when a hypothesis fits the training data too closely, capturing noise or specific details, and thus performs poorly on new, unseen data.

Answer 22

A consistent hypothesis correctly classifies all training examples, while a correct hypothesis accurately classifies both seen and unseen examples, generalizing well to new data.

Answer 23

Generalization is the ability of a model to apply what it has learned from the training examples to new, unseen examples, effectively predicting the output.

Answer 24

A hypothesis is a proposed explanation or model that represents the concept being learned, used to predict the classification of new examples.

Answer 25

The Find-S Algorithm is a simple method in concept learning that finds the most specific hypothesis that fits all positive examples.

Answer 26

The primary limitation of the Find-S Algorithm is that it only considers positive examples, ignoring negative examples, which can lead to incomplete or incorrect hypotheses.

Answer 27

Hypothesis testing in concept learning involves evaluating hypotheses against a set of examples to determine their consistency and accuracy.

Answer 28

A feature tree is a hierarchical representation of features used to describe data, where features are organized in a tree structure, typically to capture relationships and dependencies among them.

Answer 29

Feature trees help in organizing and managing features systematically, improving feature selection, engineering, and interpretation by capturing the relationships and dependencies between features.

Answer 30

While decision trees are used for making predictions based on feature splits, feature trees specifically focus on the hierarchical organization and relationship of features themselves, rather than directly making predictions.

Answer 31

A leaf node in a feature tree represents a basic or primitive feature that does not have any further sub-features.

Answer 32

An internal node in a feature tree represents a feature that has one or more sub-features, capturing a higher-level concept derived from its children.

Answer 33

Feature trees are useful in domains like natural language processing (NLP), where features such as syntactic structures, semantic roles, and word embeddings can be hierarchically organized.

Answer 34

Feature hierarchy is the organization of features in a multi-level structure where higher-level features are derived from combinations or transformations of lower-level features.

Answer 35

Feature trees aid in feature selection by highlighting important features and their relationships, helping to choose relevant features and avoid redundancy.

Answer 36

Feature granularity refers to the level of detail represented by features in the tree, with finer granularity at the leaves and coarser granularity at higher levels.

Answer 37

Yes, feature trees can be used to organize and select features, which can then be used as input to other machine learning models like neural networks, SVMs, or ensemble methods.

Answer 38

The Grow Tree algorithm is a method used to construct decision trees by recursively splitting the data into subsets based on the feature that provides the best split according to a chosen impurity measure.

Answer 39

1. Select the Best Feature: Choose the feature that best separates the data according to an impurity measure (e.g., Gini Index, Entropy). 2. Split the Data: Partition the data into subsets based on the selected feature. 3. Create Decision Nodes: Assign a decision node that represents the chosen feature and its splitting criterion. 4. Repeat Recursively: Apply the same process to each subset, creating further splits and nodes. 5. Stop Criteria: Continue until a stopping criterion is met (e.g., maximum depth, minimum samples per node, or no further improvement in impurity).

Answer 40

The primary goal of the Grow Tree algorithm is to create a tree that accurately predicts the target variable by partitioning the data into homogenous subsets that are as pure as possible.

Answer 41

Impurity measures are metrics used to evaluate how well a split separates the data into distinct classes, guiding the tree-growing process by selecting the best features to split on.

Answer 42

The Minimum Error impurity measure evaluates the misclassification rate of a node, aiming to minimize the error by selecting splits that lead to the fewest incorrect classifications.

Answer 43

The Minimum Error impurity measure is calculated as the proportion of the most common class subtracted from one: Error = 1 - max(p1,p2,...pk) where pi is the proportion of class i in the node, and k is the number of classes

Answer 44

The Gini Index is an impurity measure that calculates the likelihood of a random sample being incorrectly classified if it was randomly labeled according to the distribution of labels in the node. Gini = 1 - Sum[1,k](pi*pi) where pi is the proportion of class i in the node, and k is the number of classes

Answer 45

Entropy is an impurity measure that quantifies the amount of uncertainty or disorder within a node, used to determine the best feature to split the data by reducing uncertainty. Entropy = -Sum[1,k](pi*log2(pi)) where pi is the proportion of class i in the node, and k is the number of classes

Answer 46

The purpose of using impurity measures is to evaluate and select the best feature splits, aiming to create nodes that are as pure as possible, thereby improving the accuracy of the decision tree.

Answer 47

The Grow Tree algorithm handles continuous features by finding the optimal threshold value that best splits the data into distinct classes, often using impurity measures to evaluate different threshold values.

Answer 48

Pruning is the process of removing parts of the tree that do not provide significant power to predict target variables, helping to prevent overfitting and improve generalization.

Answer 49

Both Gini Index and Entropy measure impurity, but Gini Index is computationally simpler and often faster to compute, while Entropy can be more sensitive to changes in class distribution. They often lead to similar splits in decision trees.

Machine Learning Flashcards

(74 cards)