[MLS] Exploratory Data Analysis Flashcards by Al Them

What is the pandas python library used for?

Used to manipulate your data by placing it in dataframes or series

How well did you know this?

Not at all

Perfectly

What is the matplotlib python library used for?

For visualising data

How well did you know this?

Not at all

Perfectly

What is the seaborn python library used for?

More expansive version of matplotlib (for visualising data) including heat maps, joint plot

How well did you know this?

Not at all

Perfectly

What is the scikit_learn python library used for?

For the ML modelling itself

How well did you know this?

Not at all

Perfectly

What are the 3 major types of data?

Categorical, numerical and ordinal

How well did you know this?

Not at all

Perfectly

What is the difference between categorical and ordinal data?

Ordinal has an inherent numerical meaning, e.g. 1 star is worse than 4 stars, whereas categorical has no inherent numerical meaning, e.g. gender, race

How well did you know this?

Not at all

Perfectly

What is a bernoulli distribution?

A discrete probability distribution with only 1 trial

How well did you know this?

Not at all

Perfectly

What is a probability mass function for?

Discrete data, looks like a histogram, NOT a curve

How well did you know this?

Not at all

Perfectly

Are QuickSight dashboards read-only?

Yes

How well did you know this?

Not at all

Perfectly

What would you use a scatter plot or heat map for?

To visualise a correlation

How well did you know this?

Not at all

Perfectly

What would you use a bar chart for?

Comparison and distribution

How well did you know this?

Not at all

Perfectly

What is a geospatial chart?

A map with an overlay of information on it

How well did you know this?

Not at all

Perfectly

What does a master node do in an EMR cluster?

Manages the cluster, assigns tasks

How well did you know this?

Not at all

Perfectly

What are the differences between a core node and a task node in an EMR cluster?

Core node is constant for the task, connects to the Hadoop File System to share data
Task node is ephemeral, does not connect to HDFS

How well did you know this?

Not at all

Perfectly

When would you use a transient vs a long-running EMR cluster?

Transient to just shut down the cluster once the list of tasks you have specified are complete, a long-running EMR cluster would be if you’re not sure exactly what you want to do with it so you have easy access when you want it

How well did you know this?

Not at all

Perfectly

What is important to note about HDFS when closing down a cluster?

Study These Flashcards

It is transient - the data on it will be lost if not backed up when the cluster is shut down

What happens if a node in EMR fails in terms of the data on that node?

Study These Flashcards

Nothing - multiple copies of the data are propagated across the nodes to make sure there is no data loss when one fails

What is EMRFS?

Study These Flashcards

Elastic Map Reduce File System - uses S3 as if it was HDFS. Still fast, but not transient any more

How does EMR charging work?

Study These Flashcards

By the hour

Can you resize the core nodes running in an EMR cluster on the fly?

Study These Flashcards

Yes

What is Spark Streaming?

Study These Flashcards

Streaming analytics that works by breaking the stream into mini-batches, kind of like continuously appending rows to a table

What are EMR notebooks? Where are they backed up to and what do they allow for?

Study These Flashcards

Notebooks for Spark that are a more visual way to interact with the service. These notebooks are backed up to S3. Allows for collaboration.

Which nodes in a Spark cluster are most likely to need the most processing power?

Study These Flashcards

Core and Task nodes, Master node is just routing traffic around so doesn’t need to be as beefy.

What is the curse of dimensionality?

Study These Flashcards

Too many features can be a problem as they lead to a huge solution space within which the data is sparsely distributed (e.g. if you have 10,000 features and 10,000 rows but only 1 has the entry ‘2’ for one of those features, it will be a very very sparse data set for that one feature with the one entry of ‘2’.)

What are 2 algorithms that can be used to reduce the dimensionality of the training set?

Principal component analysis and K-means

What type of data works well with deep learning for training set imputation?

Categorical

What is MICE?

Multiple imputation by chained equations, a regression technique used for imputations in the training set

What is SMOTE?

Synthetic Minority Over-Sampling Technique, a technique to artificially balance a dataset by creating new samples of the minority class using nearest neighbours

What is binning?

Bucketing together observations based on a range of values. Transforms numerical data into ordinal data.

What is transforming a feature?

Applying a function to a feature to make it better suited to training. Not necessarily replacing data, but making new features.

What is one-hot encoding?

Creating buckets for every category, and then giving each entry a 1 or 0 to signify if they are in that category or not.

Why can shuffling be useful of the training data?

As it can get rid of any residual patterns resulting from the order in which the data was collected

What is SageMaker Ground Truth?

A service that uses humans to do data labelling at the start, and then trains a model off their labelling to do the labelling progressively more and more. The humans can be internal teams or from Mechanical Turk.

[MLS] Exploratory Data Analysis Flashcards

(33 cards)