Data Transformation, Integrity, and Feature Engineering Flashcards
(97 cards)
What is AWS EMR?
Elastic Map Reduce. This is a managed Hadoop Framework.
Does AWS EMR have Notebooks?
Yes
Can AWS EMR support Spark?
Yes
What is the main node called in an EMR cluster?
The master node
What are core nodes used for in AWS EMR?
Stores HDFS data and runs tasks
What is a Task Node in an EMR cluster?
Runs tasks, but does not host data.
What is a good way to reduce costs for task nodes?
Use spot instances.
What is a transient cluster in AWS EMR?
It automatically terminates after all the steps have been completed,
How can you start jobs in EMR?
Through connecting to the master node, or using the console and adding ordered steps.
What is the alternative storage to HDFS in AWS EMR?
S3
What is EMRFS?
It acts like HDFS, but is S3.
What is the default size of a block in HDFS?
128MB
Is HDFS ephemeral?
Yes. Good for performance though.
How can you track consistency in EMRFS?
Dynamo DB
Can Spark replace MapReduce in AWS EMR?
Yes
What is Spark SQL in AWS EMR?
Low latency query engine. Up to 100x faster than map reduce. Allows for dataframes.
What is GraphX for AWS EMR?
A graph processing framework built on top of Spark.
What is MLLib for AWS EMR?
It allows you to integrate machine learning on top of spark.
Can Spark be integrated with AWS Kinesis?
Yes
What is Zepplin?
A notebook compatible with AWS EMR
What is the problem with too many features in your data?
It leads to sparse data.
What is a dimension?
Every feature is a new dimension
What is a TF-IDF algorithem?
It figures out what terms are most relevant for a document.
What are the two components of the TF-IDF equation?
Term Frequency divided by Document Frequency