Analytics | Amazon EMR Flashcards
What is Amazon EMR?
General
Amazon EMR | Analytics
Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
What can I do with Amazon EMR?
General
Amazon EMR | Analytics
Using Amazon EMR, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon EMR lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data. The web service interfaces allow you to build processing workflows, and programmatically monitor progress of running clusters. In addition, you can use the simple web interface of the AWS Management Console to launch your clusters and monitor processing-intensive computation on clusters of Amazon EC2 instances.
Who can use Amazon EMR?
General
Amazon EMR | Analytics
Anyone who requires simple access to powerful data analysis can use Amazon EMR. You don’t need any software development experience to experiment with several sample applications available in the Developer Guide and on the AWS Big Data Blog.
What can I do with Amazon EMR that I could not do before?
General
Amazon EMR | Analytics
Amazon EMR significantly reduces the complexity of the time-consuming set-up, management. and tuning of Hadoop clusters or the compute capacity upon which they sit. You can instantly spin up large Hadoop clusters which will start processing within minutes, not hours or days. When your cluster finishes its processing, unless you specify otherwise, it will be automatically terminated so you are not paying for resources you no longer need.
Using this service you can quickly perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.
As a software developer, you can also develop and run your own more sophisticated applications, allowing you to add functionality such as scheduling, workflows, monitoring, or other features.
What is the data processing engine behind Amazon EMR?
General
Amazon EMR | Analytics
Amazon EMR uses Apache Hadoop as its distributed data processing engine. Hadoop is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware. Hadoop implements a programming model named “MapReduce,” where the data is divided into many small fragments of work, each of which may be executed on any node in the cluster. This framework has been widely used by developers, enterprises and startups and has proven to be a reliable software platform for processing up to petabytes of data on clusters of thousands of commodity machines.
What is an Amazon EMR cluster?
General
Amazon EMR | Analytics
Amazon EMR historically referred to an Amazon EMR cluster (and all processing steps assigned to it) as a “cluster”. Every cluster or cluster has a unique identifier that starts with “j-“.
What is a cluster step?
General
Amazon EMR | Analytics
A cluster step is a user-defined unit of processing, mapping roughly to one algorithm that manipulates the data. A step is a Hadoop MapReduce application implemented as a Java jar or a streaming program written in Java, Ruby, Perl, Python, PHP, R, or C++. For example, to count the frequency with which words appear in a document, and output them sorted by the count, the first step would be a MapReduce application which counts the occurrences of each word, and the second step would be a MapReduce application which sorts the output from the first step based on the counts.
What are different cluster states?
General
Amazon EMR | Analytics
STARTING – The cluster provisions, starts, and configures EC2 instances.
BOOTSTRAPPING – Bootstrap actions are being executed on the cluster.
RUNNING – A step for the cluster is currently being run.
WAITING – The cluster is currently active, but has no steps to run.
TERMINATING - The cluster is in the process of shutting down.
TERMINATED - The cluster was shut down without error.
TERMINATED_WITH_ERRORS - The cluster was shut down with errors.
What are different step states?
Launching a Cluster
Amazon EMR | Analytics
PENDING – The step is waiting to be run.
RUNNING – The step is currently running.
COMPLETED – The step completed successfully.
CANCELLED – The step was cancelled before running because an earlier step failed or cluster was terminated before it could run.
FAILED – The step failed while running.
How can I access Amazon EMR?
Launching a Cluster
Amazon EMR | Analytics
You can access Amazon EMR by using the AWS Management Console, Command Line Tools, SDKS, or the EMR API.
How can I launch a cluster?
Launching a Cluster
Amazon EMR | Analytics
You can launch a cluster through the AWS Management Console by filling out a simple cluster request form. In the request form, you specify the name of your cluster, the location in Amazon S3 of your input data, your processing application, your desired data output location, and the number and type of Amazon EC2 instances you’d like to use. Optionally, you can specify a location to store your cluster log files and SSH Key to login to your cluster while it is running. Alternatively, you can launch a cluster using the RunJobFlow API or using the ‘create’ command in the Command Line Tools.
How can I get started with Amazon EMR?
Launching a Cluster
Amazon EMR | Analytics
To sign up for Amazon EMR, click the “Sign Up Now” button on the Amazon EMR detail page http://aws.amazon.com/elasticmapreduce. You must be signed up for Amazon EC2 and Amazon S3 to access Amazon EMR; if you are not already signed up for these services, you will be prompted to do so during the Amazon EMR sign-up process. After signing up, please refer to the Amazon EMR documentation, which includes our Getting Started Guide – the best place to get going with the service.
How can I terminate a cluster?
Launching a Cluster
Amazon EMR | Analytics
At any time, you can terminate a cluster via the AWS Management Console by selecting a cluster and clicking the “Terminate” button. Alternatively, you can use the TerminateJobFlows API. If you terminate a running cluster, any results that have not been persisted to Amazon S3 will be lost and all Amazon EC2 instances will be shut down.
Does Amazon EMR support multiple simultaneous cluster?
Launching a Cluster
Amazon EMR | Analytics
Yes. At any time, you can create a new cluster, even if you’re already running one or more clusters.
How many clusters can I run simultaneously?
Developing
Amazon EMR | Analytics
You can start as many clusters as you like. You are limited to 20 instances across all your clusters. If you need more instances, complete the Amazon EC2 instance request form and your use case and instance increase will be considered. If your Amazon EC2 limit has been already raised, the new limit will be applied to your Amazon EMR clusters.
Where can I find code samples?
Developing
Amazon EMR | Analytics
Check out the sample code in these Articles and Tutorials.
How do I develop a data processing application?
Developing
Amazon EMR | Analytics
You can develop a data processing job on your desktop, for example, using Eclipse or NetBeans plug-ins such as IBM MapReduce Tools for Eclipse (http://www.alphaworks.ibm.com/tech/mapreducetools). These tools make it easy to develop and debug MapReduce jobs and test them locally on your machine. Additionally, you can develop your cluster directly on Amazon EMR using one or more instances.
What is the benefit of using the Command Line Tools or APIs vs. AWS Management Console?
Developing
Amazon EMR | Analytics
The Command Line Tools or APIs provide the ability to programmatically launch and monitor progress of running clusters, to create additional custom functionality around clusters (such as sequences with multiple processing steps, scheduling, workflow, or monitoring), or to build value-added tools or applications for other Amazon EMR customers. In contrast, the AWS Management Console provides an easy-to-use graphical interface for launching and monitoring your clusters directly from a web browser.
Can I add steps to a cluster that is already running?
Developing
Amazon EMR | Analytics
Yes. Once the job is running, you can optionally add more steps to it via the AddJobFlowSteps API. The AddJobFlowSteps API will add new steps to the end of the current step sequence. You may want to use this API to implement conditional logic in your cluster or for debugging.
Can I run a persistent cluster?
Developing
Amazon EMR | Analytics
Yes. Amazon EMR clusters that are started with the –alive flag will continue until explicitly terminated. This allows customers to add steps to a cluster on demand. You may want to use this to debug your application without having to repeatedly wait for cluster startup. You may also use a persistent cluster to run a long-running data warehouse cluster. This can be combined with data warehouse and analytics packages that runs on top of Hadoop such as Hive and Pig.
Can I be notified when my cluster is finished?
Developing
Amazon EMR | Analytics
You can sign up for up Amazon SNS and have the cluster post to your SNS topic when it is finished. You can also view your cluster progress on the AWS Management Console or you can use the Command Line, SDK, or APIs get a status on the cluster.
What programming languages does Amazon EMR support?
Developing
Amazon EMR | Analytics
You can use Java to implement Hadoop custom jars. Alternatively, you may use other languages including Perl, Python, Ruby, C++, PHP, and R via Hadoop Streaming. Please refer to the Developer’s Guide for instructions on using Hadoop Streaming.
What OS versions are supported with Amazon EMR?
Developing
Amazon EMR | Analytics
At this time Amazon EMR supports Debian/Squeeze in 32 and 64 bit modes.
Can I view the Hadoop UI while my cluster is running?
Developing
Amazon EMR | Analytics
Yes. Please refer to the Hadoop UI section in the Developer’s Guide for instructions on how to access the Hadoop UI.