Week 1 Flashcards
(28 cards)
Structured Data:
Data that can be stored in a structured way (like in the table above).
Unstructured Data:
Data not easily stored or described (i.e. text from social media)
Quantitative Data:
Numbers with a meaning (i.e. 3 baseballs)
Categorical Data:
Numbers without meaning (i.e. an area code or country of origin)
Binary Data:
Data that takes one of two values (i.e. yes or no)
Unrelated Data:
No relationship between data points (i.e. players on different teams)
Time Series Data:
Same data recorded over time (i.e. an athlete’s performance over time)
Scaling Data:
Transforming your data so that features are within a specific range (i.e. 0-1)
Standardizing Data:
Change your observations so they can be described as a normal distribution
Validation:
Verifying that models are performing as intended
Hard Classifiers:
Classifies into groups perfectly
Soft Classifiers:
Gives as good of a separation as possible
SVM
Support vector machines are supervised machine learning models used for classification.
SUpport Vector
comes from the idea of having a line that touches the edge of the shape (or ‘supports’ it) is called a support vector.
TF The support vector machine automatically (machine) determines support vectors, or the points supporting the shape on parallel lines.
True
Goal of SVM
The goal is to maximize (or optimize) the space between the support vectors to minimize errors between the classes.
Lambda in SVM
controls the weight, so as it grows, the margin outweighs any error, and as it becomes zero, minimizing mistakes becomes much more important. We can add a multiplier mj per error to weigh the errors, with the larger multiplier being more important than a smaller one.
minimize error and margin equation
What happens to our svm if we have data that varies widely in range?
our sv model may be thrown off if we have data that varies widely in range. Remember that SVM’s goal is to maximize the distance between the separating plane and the support vectors.If one feature is much bigger than another (i.e X1 is .3-.6 and X2 is 1000-2000), the large range will dominate the model and throw off our results.
WHat is the most common scaling
between 0 and 1
How do you scale to a normal distribution
you scale the data to a mean of 0 and a standard deviation of 1.
You use scaling (or normalizing) when you’re working with data of what kind?
In bounded range
bartting avg
sat scores
WHat kinds of models do you use standardization with>
PCA
Clustering
How does KNN classify data?
Rather than using a line to separate data into classes, the KNN algorithm classifies data by looking at a data point’s “nearest neighbors.”