Chapter 2 Quiz Flashcards
(32 cards)
methods are trained on a set of training data and then their performance is evaluated on a separate set of validation data
data partitioning
tasks of classification and prediction as well as pattern discovery
predictive analytics
trying to predict value of categorical variable
classification
trying to predict value of numerical variable
prediction
finding general associations patterns between items in large databases through rules general to an entire population
association rules/affinity analysis
method that uses individual users’ preferences and tastes given their historic purchases or measurable behavior indicative of preference
collaborative filtering
consolidating a large number of records into a smaller set
data reduction
methods for reducing the number of cases
clustering
reduction of the number of variables
dimension reduction
exploration by creating charts and dashboards
data visualization/visual analytics
used in classification and prediction, must have data available in which the value of the outcome of interest is known
supervised learning algorithms
data from which classification or prediction algorithm learns
training data
sample of data where the outcome is known used for comparison between models
validation data
sample of data where the outcome is known used to predict how well the model will do
test data
there is no outcome variable to predict or classify
unsupervised learning algorithm
steps in machine learning
understand project
obtain data
preprocess data
reduce dimensions (if necessary)
determine ML task
partition data
choose ML technique
perform task
interpret results
deploy model
SEMMA
sample, explore, modify, model, assess
character, integer, categorical
types of variables
unordered categorical variables
nominal variables
ordered categorical variables
ordinal variables
categorical variables decomposed into a series of binary variables
dummy variables
creating different binary dummy variables for more than one category
one-hot encoding
values that lie far away from the bulk of the data
outliers
knowledge of the particular application being considered
domain knowledge