Analytics Flashcards
What does Glue create when it scans your unstructured data in S3?
It creates metadata which can be used to query the data.
What is Hive?
It allows you to run SQL like queries from EMR
Can you import a Hive metastore into AWS Glue?
Yes. You can also import AWS Glue metadata into Hive.
How can you increase the performance of a Spark Job?
Provision additional DPUs (Data processing units).
How can you determine how many DPUs you will need for your job?
Enable job metrics to understand the maximum capacity in DPUs that you will need.
Where are Glue errors reported?
Cloudwatch
How can you schedule Glue jobs?
Glue Scheduler.. This is the most straight forward approach.
What is a DynamicFrame in AWS Glue?
a collection of dynamicRecords
What is a DynamicRecord in AWS Glue?
They are records that are self-describing and have a schema.
Using native AWS Glue functionality, how can you drop fields or null fields?
DropFields or DropNullFields transformation
Using AWS Glue, how can you select a subset of records during your ETL process?
Using filter transformation
How can you enrich your data from another source in AWS Glue?
Use the join transformation
What does the Map transformation in AWS Glue do?
It allows you to add fields, delete fields, and perform external lookups.
What does the FindMatches ML transformation do in AWS Glue?
It identifies duplicate or matching records in your dataset. Even when the records do not have a common identifier.
What format conversions can AWS Glue support?
CSV, JSON, Avro, Parquet, ORC, XML
What does AWS Glue ResolveChoice do?
It deals with ambiguities in your DynamicFrame and returns a new one. example is two fields with the same name.
How do you update your Glue Data Catalog?
You can re-run the crawler or have a script use enableUpdateCatalog / updateBehavior
What are AWS Glue Development endpoints?
They allow you to use a notebook to develop your ETL script. They are launched in a VPC and can be used with SageMaker notebook or Zeppelin.
What do AWS Job bookmarks do?
It keeps track of where you left off so you are not reprocessing old data. Works with S3 sources and relational databases. It only works with new rows in a database, not updated ones. The primary key also needs to be sequential.
Can you start a step function from an AWS Glue event?
Yes
How are you billed using AWS Glue?
You are billed by the second.
How are AWS Glue development endpoints billed?
By the minute
If you want to use engines like Hive or Pig, what AWS service is the best fit?
EMR. Glue is based on Spark
Can AWS Glue process streaming data?
Yes. It can do this from Kinesis or Kafka.