Quiz #2 Flashcards
What steps and primary questions comprise the data wrangling process?
- What is the population of interest?
- 1 What sample S are we evaluating?
- Is sample S representative of the population?
- How do we cross-validate to evalute our model? How do we ovoid overfitting and data mining?
- What prediction task (classification vs. regression) do we care about? What is the meaningful evaluation criteria?
- How do we create a reproducible pipeline?
What are some examples of the definition of a population (in data terms)?
All users on Facebook.
All US users on Facebook.
All US users on Facebook in the last month
All the watermelons in the back of the truck.
All the watermelons greater than 5lbs in the back of the truck.
Etc…
How do we obtain data from a population?
Sampling
What are two simple probability-based methods for sampling?
- Random Sampling
2. Stratified Random Sampling
What is simple random sampling of a population?
Every observation from the population has the same chance of being sampled
What is stratified random sampling of a population?
Population is partitioned into groups and then a simple random sampling approach is applied within each group.
Example: In the watermelons in the back of the truck example, we could partition into 3 groups: (1) less than 5lbs, (2) greater than 5lbs but less than 10lbs, and (3) greater than 10lbs. We could then randomly sample within each group. This is stratified random sampling.
What are some best practices for data wrangling?
- Clearly define your population and sample
- Understand the representativeness of your sample
- Cross-validation can go wrong in many ways; understand the relevant problem and prediction task that will be done in practice
- Know the prediction task of interest (regression vs. classification)
- Incorporate model checks and evaluate multiple predictive performance metrics?
What is Cross Validation (CV)?
A method for estimating prediction error.
Grid search is always better than random search when trying to optimize hyperparameters? (True/False)
False. One 2012 paper by Bergstra and Bengio found that random search is often just as good, if not better, than grid search.
What are two methods for handling class imbalances?
- Sampling based, e.g. SMOTE (Synthetic Minority Oversampling, etc.)
- Cost-based, e.g. Focal Loss for object detection
What is one type of plot we can use to gauge the confidence a model has in its prediction?
Calibration plot
It isn’t necessary to include a datasheet when creating a new dataset? (True/False)
False. It can be very helpful to future researchers (including yourself!) to understand how the dataset was constructed.
What things should be included in a datasheet for a dataset?
- Motivation (why is the dataset needed?)
- Composition
- Collection process
- Recommended uses
…etc.
What are the three steps in the Data Cleaning process for ML?
- Clean
- Transform
- Preprocess
What are three mechanisms that can cause missing data?
- Missing completely at random.
- Missing at random: likelihood of any observation to be missing depends on OBSERVED data features (ex: men are less likely to fill out surveys about depression)
- Missing not at random: likelihood of any observation to be missing depends on UNOBSERVED outcome (ex: a person might be less likely to complete a survey if they are depressed)
What are some ways we can fix missing data?
- Remove (easy, but wasteful)
2. Imputation (mean/median, using learned model to predict, etc.)
What are some examples of the data transformation step in the data cleaning process?
- Converting categorical to index (ordinal numbering, one-hot encoding, etc)
- Bag-of-words
- TF-IDF
- Embeddings
…etc
What are some examples of the data preprocessing step in the data cleaning process?
Zero-center data, normalization, etc
What are three important components of fairness in ML?
- Anti-classification: verifying that protected attributes like race, gender, etc (and their proxies!)
- Classification Parity: common measures of predictive performances are equal across groups defined by protected attributes.
- Calibration: conditional on risk estimates, outcomes are independent of protected
What is an example of how a proxy to a protected attribute might result in an unfair ML model?
One example might be using features like zip code in areas with high racial segregation. If the model learns that zip code is an important discriminatory feature, there’s a good chance that it has learned a subtle proxy for racial discrimination.
Layers in a NN must always be fully connected? (True/False)
False. Other connectivity structures are possible, and in many cases (like images) desirable.
Why does it make sense to consider small patches of inputs when building a NN for image data? What are these small patches called?
They are called receptive fields, modeled after similar structure in the human visual cortex. They make sense to use because while structure exists in image data, it’s often localized, such as edges and lines, and collections of those lines and edges forming higher level motifs.
Why does using linear layers not make sense for some applications?
Consider the case of image data. If we connect each pixel to every weight in a hidden linear layer, there could be hundreds of millions of parameters to learn for just one layer. Furthermore, patterns in images tend to be SPATIALLY LOCAL. A pixel in the upper right corner in all likelihood will have very little to do with a pixel in the lower left.
As the number of parameters to learn in a model increase, more data is needed to ensure a robust model that generalizes to new data? (True/False)
True.