[MLS] Exploratory Data Analysis Flashcards
(33 cards)
What is the pandas python library used for?
Used to manipulate your data by placing it in dataframes or series
What is the matplotlib python library used for?
For visualising data
What is the seaborn python library used for?
More expansive version of matplotlib (for visualising data) including heat maps, joint plot
What is the scikit_learn python library used for?
For the ML modelling itself
What are the 3 major types of data?
Categorical, numerical and ordinal
What is the difference between categorical and ordinal data?
Ordinal has an inherent numerical meaning, e.g. 1 star is worse than 4 stars, whereas categorical has no inherent numerical meaning, e.g. gender, race
What is a bernoulli distribution?
A discrete probability distribution with only 1 trial
What is a probability mass function for?
Discrete data, looks like a histogram, NOT a curve
Are QuickSight dashboards read-only?
Yes
What would you use a scatter plot or heat map for?
To visualise a correlation
What would you use a bar chart for?
Comparison and distribution
What is a geospatial chart?
A map with an overlay of information on it
What does a master node do in an EMR cluster?
Manages the cluster, assigns tasks
What are the differences between a core node and a task node in an EMR cluster?
Core node is constant for the task, connects to the Hadoop File System to share data
Task node is ephemeral, does not connect to HDFS
When would you use a transient vs a long-running EMR cluster?
Transient to just shut down the cluster once the list of tasks you have specified are complete, a long-running EMR cluster would be if you’re not sure exactly what you want to do with it so you have easy access when you want it
What is important to note about HDFS when closing down a cluster?
It is transient - the data on it will be lost if not backed up when the cluster is shut down
What happens if a node in EMR fails in terms of the data on that node?
Nothing - multiple copies of the data are propagated across the nodes to make sure there is no data loss when one fails
What is EMRFS?
Elastic Map Reduce File System - uses S3 as if it was HDFS. Still fast, but not transient any more
How does EMR charging work?
By the hour
Can you resize the core nodes running in an EMR cluster on the fly?
Yes
What is Spark Streaming?
Streaming analytics that works by breaking the stream into mini-batches, kind of like continuously appending rows to a table
What are EMR notebooks? Where are they backed up to and what do they allow for?
Notebooks for Spark that are a more visual way to interact with the service. These notebooks are backed up to S3. Allows for collaboration.
Which nodes in a Spark cluster are most likely to need the most processing power?
Core and Task nodes, Master node is just routing traffic around so doesn’t need to be as beefy.
What is the curse of dimensionality?
Too many features can be a problem as they lead to a huge solution space within which the data is sparsely distributed (e.g. if you have 10,000 features and 10,000 rows but only 1 has the entry ‘2’ for one of those features, it will be a very very sparse data set for that one feature with the one entry of ‘2’.)