Analytics Flashcards
What is AWS Glue?
It performs discovery on the underlying schema of your data.
It also performs custom ETL jobs.
What is stored in the Glue Data Catalog?
Your table definitions or schemas. All the original data is still in S3.
What does Glue really allow you to do?
Query your unstructured data in S3 like it is structured data.
What is Hive?
It runs on EMR and allows you to run SQL like queries.
Can a Hive metastore be used in Glue?
Yes
Can a Glue Data Catalog be used in Hive?
Yes
What does enabling Job Metrics do in AWS Glue?
It helps you understand the maximum DPU that you need for your Glue job.
Where can you plot the Glue Job Metrics maximum needed executors versus maximum allocated?
In the Glue console. You do not need cloudwatch for this.
What is a dynamic frame in AWS Glue?
A collection of dynamic records
How can I remove outliers in my data in AWS Glue ETL?
Use the filter transformation
Can you join data in AWS Glue ETL?
Yes
How do you find matches or duplicates in your data in AWS Glue when there is no common unique identifier?
Use the FindMatchesML transformation.
Can you convert formats in Glue?
Yes
What does ResolveChoice do in AWS Glue?
It deals with Ambuguities in your data, eg., two columns named price.
What options does ResolveChoice in AWS Glue have?
Make_cols - Makes columns
Cast - Casts to a specific type
Make_Struct Creates structure that contains each data type
Project: Projects every type to a given type
How do you modify your Glue Data Catalog when you added a new partition to your data?
Run the enableUpdateCatalog and PartitionKeys option.
How do you modify your Glue Data Catalog when you added a new Schema or table to your data?
EnableUpdateCatalog / updateBehavior
How are you billed in AWS Glue?
By the second.
How are you billed for development endpoints in AWS Glue?
By the minute.
When you want to use Hive or Pig, what ETL engine should you use as a matter of best practice?
EMR
Can Glue ingest streaming data?
Yes. From Kinesis or Kafka
What is Glue Data Quality?
Rules for your data quality, if the threshold is exceeded the job can stop or a cloudwatch alarm can be triggered.
What language does Glue Data Quality support?
DQDL
What are recipes in Glue Data Brew?
They are transformations that can be saved and applied to other jobs.