Processing Flashcards
You are going to be working with objects arriving in S3. Once they arrive you want to use AWS Lambda as a part of an AWS Data Pipeline to process and transform the data. How can you easily configure Lambda to know the data has arrived in a bucket?
Configure S3 bucket notifications to Lambda (Lambda functions are generally invoked by some sort of trigger. S3 has the ability to trigger a Lambda function whenever a new object appears in a bucket.)
You are going to analyze the data coming in an Amazon Kinesis stream. You are going to use Lambda to process these records. What is a prerequisite when it comes to defining Lambda to access Kinesis stream records ?
The Kinesis stream should be in the same account (Lambda must be in the same account as the service triggering it, in addition to having an IAM policy granting it access.)
How can you make sure your Lambda functions have access to the other resources you are using in your big data architecture like S3, Redshift, etc.?
Using proper IAM roles (IAM roles define the access a Lambda function has to the services it communicates with.)
You are creating a Lambda - Kinesis stream environment in which Lambda is to check for the records in the stream and do some processing in its Lambda function. How does Lambda know there has been changes / updates to the Kinesis stream ?
Lambda polls Kinesis streams (Although you think of a trigger as “pushing” events, Lambda actually polls your Kinesis streams for new activity.)
When using an Amazon Redshift database loader, how does Lambda keep track of files arriving in S3 to be processed and sent to Redshift ?
In a DynamoDB table
You want to load data from a MySQL server installed in an EC2 t2.micro instance to be processed by AWS Glue. What applies the best here?
Instance should be in your VPC (Although we didn’t really discuss access controls, you could arrive at this answer through process of elimination. You’ll find yourself doing that on the exam a lot. This isn’t really a Glue specific question; it’s more about how to connect an AWS service such as Glue to EC2.)
What is the simplest way to make sure the metadata under Glue Data Catalog is always up-to-date and in-sync with the underlying data without your intervention each time?
Schedule crawlers to run periodically (Crawlers may be easily scheduled to run periodically while defining them.)
Which programming languages can be used to write ETL code for AWS Glue?
Python and Scala (Glue ETL runs on Apache Spark under the hood, and these happen to be the primary languages used for Spark development.)
Can you run existing ETL jobs with AWS Glue?
Yes (You can run your existing Scala or Python code on AWS Glue. Simply upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.)
How can you be notified of the execution of AWS Glue jobs?
Using Cloudwatch + SNS
Of the following tools with Amazon EMR, which one is used for querying multiple data stores at once?
- Presto
- Hue
- Ganglia
- Ambari
Presto
Which one of the following statements is NOT TRUE regarding EMR Notebooks?
- EMR Notebook is stopped if is idle for an extended time
- EMR Notebooks currently do not integrate with repositories for version control
- EMR Notebooks can be opened without logging into the AWS Management Console
- You cannot attach your notebook to a Kerberos enabled EMR cluster
EMR Notebooks can be opened without logging into the AWS Management Console
How can you get a history of all EMR API calls made on your account for security or compliance auditing?
Using AWS CloudTrail
When you delete your EMR cluster, what happens to the EBS volumes?
EMR will delete the volumes once the EMR cluster terminated.
Which one of the following statements is NOT TRUE regarding Apache Pig?
- Pig supports interactive and batch cluster types
- Pig is operated by a SQL-like language called Pig Latin
- When used with Amazon EMR, Pig allows accessing multiple filesystems
- Pig supports access through JDBC
Pig supports access through JDBC
What limit, if any, is there to the size of your training dataset in Amazon Machine Learning by default?
100 GB (By default, Amazon ML is limited to 100GB of training data. You can file a support ticket to get this increased, but Amazon ML cannot handle terabyte-scale data.)
The audit team of an organization needs a history of Amazon SageMaker API calls made on their account for security analysis and operational troubleshooting purposes. Which of the following service helps in this regard?
CloudTrail (SageMaker outputs its results to both CloudTrail and CloudWatch, but CloudTrail is specifically designed for auditing purposes.)
Is there a limit to the size of the dataset that you can use for training models with Amazon SageMaker? If so, what is the limit?
No fixed limit (There are no fixed limits to the size of the dataset you can use for training models with Amazon SageMaker.)
Which of the following is a new Amazon SageMaker capability that enables machine learning models to train once and run anywhere in the cloud and at the edge?
- SageMaker Neo
- SageMaker Search
- Batch Transform
- Jupyter Notebooks
SageMaker Neo
A Python developer is planning to develop a machine learning model to predict real estate prices using a Jupyter notebook and train and deploy this model in a high available and scalable manner. The developer wishes to avoid worrying about provisioning sufficient capacity for this model. Which of the following services is best suited for this?
Amazon SageMaker (SageMaker is the only scalable solution that is both fully managed and uses Jupyter notebooks.)
Which open-source Web interface provides you with a easy way to run scripts, manage the Hive metastore, and view HDFS?
- Apache Zeppelin
- Ganglia
- YARN Resource Manager
- Hue
-Hue
(Hue (Hadoop User Experience) is an open-source, web-based, graphical user interface for use with Amazon EMR and Apache Hadoop. Hue groups together several different Hadoop ecosystem projects into a configurable interface for your Amazon EMR cluster. Further information: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hue.html)
Which of the following are the 4 modules (libraries) of Spark? (Choose 4)
- Apache Mesos
- SparkSQL
- Spark Streaming
- GraphX
- MLlib
- YARN
- SparkSQL
- Spark Streaming
- GraphX
- MLlib
Which of the following does Spark Streaming use to consume data from a Kinesis Stream?
- Kinesis Client Library
- Kinesis Consumer Library
- Kinesis Connector Library
- Kinesis Producer Library
Kinesis Client Library
(Spark Streaming uses the Kinesis Client Library (KCL) to consume data from a Kinesis stream. KCL handles complex tasks like load balancing, failure recovery, and check-pointing. Further information: https://spark.apache.org/docs/latest/streaming-kinesis-integration.html)
True or False: EBS volumes used with EMR persist after the cluster is terminated.
False
(When EBS volumes are used with EMR, the volumes do not persist after cluster termination. Compare this to how EBS behaves when used with an ordinary RC2 instance: it is possible for the volume to persist after its instance is terminated.)