Amazon EMR | Using Hive Flashcards by Parri Pandian

What happens when I remove an attached volume from a running cluster?

Using Hive

Amazon EMR | Analytics

Removing an attached volume from a running cluster will be treated as a node failure. Amazon EMR will replace the node and the EBS volume with each of the same.

How well did you know this?

Not at all

Perfectly

What is Apache Hive?

Using Hive

Amazon EMR | Analytics

Hive is an open source datawarehouse and analytics package that runs on top of Hadoop. Hive is operated by a SQL-based language called Hive QL that allows users to structure, summarize, and query data sources stored in Amazon S3. Hive QL goes beyond standard SQL, adding first-class support for map/reduce functions and complex extensible user-defined data types like Json and Thrift. This capability allows processing of complex and even unstructured data sources such as text documents and log files. Hive allows user extensions via user-defined functions written in Java and deployed via storage in Amazon S3.

How well did you know this?

Not at all

Perfectly

What can I do with Hive running on Amazon EMR?

Using Hive

Amazon EMR | Analytics

Using Hive with Amazon EMR, you can implement sophisticated data-processing applications with a familiar SQL-like language and easy to use tools available with Amazon EMR. With Amazon EMR, you can turn your Hive applications into a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence tasks.

How well did you know this?

Not at all

Perfectly

How is Hive different than traditional RDBMS systems?

Using Hive

Amazon EMR | Analytics

Traditional RDBMS systems provide transaction semantics and ACID properties. They also allow tables to be indexed and cached so that small amounts of data can be retrieved very quickly. They provide for fast update of small amounts of data and for enforcement of referential integrity constraints. Typically they run on a single large machine and do not provide support for executing map and reduce functions on the table, nor do they typically support acting over complex user defined data types.

In contrast, Hive executes SQL-like queries using MapReduce. Consequently, it is optimized for doing full table scans while running on a cluster of machines and is therefore able to process very large amounts of data. Hive provides partitioned tables, which allow it to scan a partition of a table rather than the whole table if that is appropriate for the query it is executing.

Traditional RDMS systems are best for when transactional semantics and referential integrity are required and frequent small updates are performed. Hive is best for offline reporting, transformation, and analysis of large data sets; for example, performing click stream analysis of a large website or collection of websites.

One of the common practices is to export data from RDBMS systems into Amazon S3 where offline analysis can be performed using Amazon EMR clusters running Hive.

How well did you know this?

Not at all

Perfectly

How can I get started with Hive running on Amazon EMR?

Using Hive

Amazon EMR | Analytics

The best place to start is to review our written documentation located here.

How well did you know this?

Not at all

Perfectly

Are there new features in Hive specific to Amazon EMR?

Using Hive

Amazon EMR | Analytics

Yes. There are four new features which make Hive even more powerful when used with Amazon EMR, including:

a/ The ability to load table partitions automatically from Amazon S3. Previously, to import a partitioned table you needed a separate alter table statement for each individual partition in the table. Amazon EMR a now includes a new statement type for the Hive language: “alter table recover partitions.” This statement allows you to easily import tables concurrently into many clusters without having to maintain a shared meta-data store. Use this functionality to read from tables into which external processes are depositing data, for example log files.

b/ The ability to specify an off-instance metadata store. By default, the metadata store where Hive stores its schema information is located on the master node and ceases to exist when the cluster terminates. This feature allows you to override the location of the metadata store to use, for example a MySQL instance that you already have running in EC2.

c/ Writing data directly to Amazon S3. When writing data to tables in Amazon S3, the version of Hive installed in Amazon EMR writes directly to Amazon S3 without the use of temporary files. This produces a significant performance improvement but it means that HDFS and S3 from a Hive perspective behave differently. You cannot read and write within the same statement to the same table if that table is located in Amazon S3. If you want to update a table located in S3, then create a temporary table in the cluster’s local HDFS filesystem, write the results to that table, and then copy them to Amazon S3.

d/ Accessing resources located in Amazon S3. The version of Hive installed in Amazon EMR allows you to reference resources such as scripts for custom map and reduce operations or additional libraries located in Amazon S3 directly from within your Hive script (e.g., add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar).

How well did you know this?

Not at all

Perfectly

What types of Hive clusters are supported?

Using Hive

Amazon EMR | Analytics

There are two types of clusters supported with Hive: interactive and batch. In an interactive mode a customer can start a cluster and run Hive scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Hive script is stored in Amazon S3 and is referenced at the start of the cluster. Typically, batch mode is used for repeatable runs such as report generation.

How well did you know this?

Not at all

Perfectly

How can I launch a Hive cluster?

Using Hive

Amazon EMR | Analytics

Both batch and interactive clusters can be started from AWS Management Console, EMR command line client, or APIs. Please refer to the Hive section in the Release Guide for more details on launching a Hive cluster.

How well did you know this?

Not at all

Perfectly

When should I use Hive vs. PIG?

Using Hive

Amazon EMR | Analytics

Hive and PIG both provide high level data-processing languages with support for complex data types for operating on large datasets. The Hive language is a variant of SQL and so is more accessible to people already familiar with SQL and relational databases. Hive has support for partitioned tables which allow Amazon EMR clusters to pull down only the table partition relevant to the query being executed rather than doing a full table scan. Both PIG and Hive have query plan optimization. PIG is able to optimize across an entire scripts while Hive queries are optimized at the statement level.

Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.

How well did you know this?

Not at all

Perfectly

What version of Hive does Amazon EMR support?

Using Hive

Amazon EMR | Analytics

Amazon EMR supports multiple versions of Hive, including version 0.11.0.

How well did you know this?

Not at all

Perfectly

Can I write to a table from two clusters concurrently

Using Hive

Amazon EMR | Analytics

No. Hive does not support concurrently writing to tables. You should avoid concurrently writing to the same table or reading from a table while you are writing to it. Hive has non-deterministic behavior when reading and writing at the same time or writing and writing at the same time.

How well did you know this?

Not at all

Perfectly

Can I share data between clusters?

Using Hive

Amazon EMR | Analytics

Yes. You can read data in Amazon S3 within a Hive script by having ‘create external table’ statements at the top of your script. You need a create table statement for each external resource that you access.

How well did you know this?

Not at all

Perfectly

Should I run one large cluster, and share it amongst many users or many smaller clusters?

Using Hive

Amazon EMR | Analytics

Amazon EMR provides a unique capability for you to use both methods. On the one hand one large cluster may be more efficient for processing regular batch workloads. On the other hand, if you require ad-hoc querying or workloads that vary with time, you may choose to create several separate cluster tuned to the specific task sharing data sources stored in Amazon S3.

How well did you know this?

Not at all

Perfectly

Can I access a script or jar resource which is on my local file system?

Using Hive

Amazon EMR | Analytics

No. You must upload the script or jar to Amazon S3 or to the cluster’s master node before it can be referenced. For uploading to Amazon S3 you can use tools including s3cmd, jets3t or S3Organizer.

How well did you know this?

Not at all

Perfectly

Can I run a persistent cluster executing multiple Hive queries?

Using Hive

Amazon EMR | Analytics

Yes. You run a cluster in a manual termination mode so it will not terminate between Hive steps. To reduce the risk of data loss we recommend periodically persisting all of your important data in Amazon S3. It is good practice to regularly transfer your work to a new cluster to test you process for recovering from master node failure.

How well did you know this?

Not at all

Perfectly

Can multiple users execute Hive steps on the same source data?

Using Hive

Amazon EMR | Analytics

Yes. Hive scripts executed by multiple users on separate clusters may contain create external table statements to concurrently import source data residing in Amazon S3.

Can multiple users run queries on the same cluster?

Using Hive

Amazon EMR | Analytics

Yes. In the batch mode, steps are serialized. Multiple users can add Hive steps to the same cluster, however, the steps will be executed serially. In interactive mode, several users can be logged on to the same cluster and execute Hive statements concurrently.

Can data be shared between multiple AWS users?

Using Hive

Amazon EMR | Analytics

Yes. Data can be shared using standard Amazon S3 sharing mechanism described here http://docs.amazonwebservices.com/AmazonS3/latest/index.html?S3_ACLs.html

Does Hive support access from JDBC?

Using Hive

Amazon EMR | Analytics

Yes. Hive provides JDBC drive, which can be used to programmatically execute Hive statements. To start a JDBC service in your cluster you need to pass an optional parameter in the Amazon EMR command line client. You also need to establish an SSH tunnel because the security group does not permit external connections.

What is your procedure for updating packages on EMR AMIs?

Using Hive

Amazon EMR | Analytics

We run a select set of packages from Debian/stable including security patches. We will upgrade a package whenever it gets upgraded in Debian/stable. The “r-recommended” package on our image is up to date with Debian/stable (http://packages.debian.org/search?keywords=r-recommended).

Can I update my own packages on EMR clusters?

Using Hive

Amazon EMR | Analytics

Yes. You can use Bootstrap Actions to install updates to packages on your clusters.

Brainscape's Knowledge GenomeTM

Amazon EMR | Using Hive Flashcards

Brainscape's Knowledge Genome^TM