[MLS] Exploratory Data Analysis Flashcards

(33 cards)

1
Q

What is the pandas python library used for?

A

Used to manipulate your data by placing it in dataframes or series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the matplotlib python library used for?

A

For visualising data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the seaborn python library used for?

A

More expansive version of matplotlib (for visualising data) including heat maps, joint plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the scikit_learn python library used for?

A

For the ML modelling itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 3 major types of data?

A

Categorical, numerical and ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between categorical and ordinal data?

A

Ordinal has an inherent numerical meaning, e.g. 1 star is worse than 4 stars, whereas categorical has no inherent numerical meaning, e.g. gender, race

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a bernoulli distribution?

A

A discrete probability distribution with only 1 trial

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a probability mass function for?

A

Discrete data, looks like a histogram, NOT a curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are QuickSight dashboards read-only?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What would you use a scatter plot or heat map for?

A

To visualise a correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What would you use a bar chart for?

A

Comparison and distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a geospatial chart?

A

A map with an overlay of information on it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does a master node do in an EMR cluster?

A

Manages the cluster, assigns tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the differences between a core node and a task node in an EMR cluster?

A

Core node is constant for the task, connects to the Hadoop File System to share data
Task node is ephemeral, does not connect to HDFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When would you use a transient vs a long-running EMR cluster?

A

Transient to just shut down the cluster once the list of tasks you have specified are complete, a long-running EMR cluster would be if you’re not sure exactly what you want to do with it so you have easy access when you want it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is important to note about HDFS when closing down a cluster?

A

It is transient - the data on it will be lost if not backed up when the cluster is shut down

17
Q

What happens if a node in EMR fails in terms of the data on that node?

A

Nothing - multiple copies of the data are propagated across the nodes to make sure there is no data loss when one fails

18
Q

What is EMRFS?

A

Elastic Map Reduce File System - uses S3 as if it was HDFS. Still fast, but not transient any more

19
Q

How does EMR charging work?

20
Q

Can you resize the core nodes running in an EMR cluster on the fly?

21
Q

What is Spark Streaming?

A

Streaming analytics that works by breaking the stream into mini-batches, kind of like continuously appending rows to a table

22
Q

What are EMR notebooks? Where are they backed up to and what do they allow for?

A

Notebooks for Spark that are a more visual way to interact with the service. These notebooks are backed up to S3. Allows for collaboration.

23
Q

Which nodes in a Spark cluster are most likely to need the most processing power?

A

Core and Task nodes, Master node is just routing traffic around so doesn’t need to be as beefy.

24
Q

What is the curse of dimensionality?

A

Too many features can be a problem as they lead to a huge solution space within which the data is sparsely distributed (e.g. if you have 10,000 features and 10,000 rows but only 1 has the entry ‘2’ for one of those features, it will be a very very sparse data set for that one feature with the one entry of ‘2’.)

25
What are 2 algorithms that can be used to reduce the dimensionality of the training set?
Principal component analysis and K-means
26
What type of data works well with deep learning for training set imputation?
Categorical
27
What is MICE?
Multiple imputation by chained equations, a regression technique used for imputations in the training set
28
What is SMOTE?
Synthetic Minority Over-Sampling Technique, a technique to artificially balance a dataset by creating new samples of the minority class using nearest neighbours
29
What is binning?
Bucketing together observations based on a range of values. Transforms numerical data into ordinal data.
30
What is transforming a feature?
Applying a function to a feature to make it better suited to training. Not necessarily replacing data, but making new features.
31
What is one-hot encoding?
Creating buckets for every category, and then giving each entry a 1 or 0 to signify if they are in that category or not.
32
Why can shuffling be useful of the training data?
As it can get rid of any residual patterns resulting from the order in which the data was collected
33
What is SageMaker Ground Truth?
A service that uses humans to do data labelling at the start, and then trains a model off their labelling to do the labelling progressively more and more. The humans can be internal teams or from Mechanical Turk.