Elastic Map Reduce Flashcards by Munza Hasan

_ _describes the realization of _ _ _ by ,, and _ data that was previously ignored or siloed due to the limitations of _ _ management technolgies

Big data
greater business intelligence
storing
processing
analzying
traditional data

Big data describes the realization greater business intelligence by storing, processing, and analyzing that was previously ignored or siloed due to the limitation of traditional data managment technologies

How well did you know this?

Not at all

Perfectly

The V’s of Big Data
_ is the _ data travels
_ is the _ data requires
_ is the _ types of _

Velocity, speed
Volume, space
Variety, heterogenous. files

-Velocity is the speed data travels
-Volume is the space data requires
-Variety is the heterogenous types of files

How well did you know this?

Not at all

Perfectly

Velocity
_ _ from many sources at a _ _ of _
3 examples

Velocity
Ingesting data from many sources at a high rate of space
-Internet of things (IOT)
-clickstream data
-environmental data

How well did you know this?

Not at all

Perfectly

Volume
_ (one character)
_ (1000 bytes)
_ (1000^2 bytes)
_ (1000^3 bytes)
_ (1000^4 bytes)
_ (1000^5 bytes)
_ (1000^6 bytes)
_ (1000^7 bytes)

fun fact: Single oil well generates _ _ data per day.

Byte (one character)
Kilobyte (1000 bytes)
Megabyte (1000^2 bytes)
Gigabyte (1000^3)
Terabyte (1000^4)
Petabyte (1000^5)
Extabyte (1000^6)
Zettabyte (1000^7)

Single oil well generates 15 terabytes of data per day

BKMGTPEZ

How well did you know this?

Not at all

Perfectly

Volume Examples

-A standard work year - 2,016 hours
-YouTube (Google) Content ID System
-Looks for copyright violations in uploaded videos
-YouTube”s content ID system processes 250 years of video content in 24 hours

How well did you know this?

Not at all

Perfectly

Variety Examples

RDBMS- Relation data files
XML files
log files
unstructured text files
HTML files
PDF files
Video files

How well did you know this?

Not at all

Perfectly

Big Data
Is Big Data just a _ in _?
Is big data just a _ _ for technolgies that always existed, but were just called something else?
Completely different _ for _ and _ _

fad, technology
new name
architeture, computing, data storage

-is big data just a fad in technology
-is big data just new name for technolgies that always existed but were just called soemthing else
-completely different architecture for computing and data storage

How well did you know this?

Not at all

Perfectly

Traditional computing model
Data stored in a _ _ like a _
Data copied to _at _ _
_ _ bottlenecks on the _ _

data stored in a central location like a SAN
-Data copied to proccerers at run time
-Large volumes bottelnecks on the transfer rate

How well did you know this?

Not at all

Perfectly

Hadoop Computing Model
Bring the _ _ _ _
_ and _ data when the _ _ _
Run the _ where the _ _

program to the data
replicate, distibute, data is stored
program, data resides

-Bring the model to the data
-Replicte and distribute data when the data is stored
-Run the program where the data resides

How well did you know this?

Not at all

Perfectly

Distributions
_ is a _ of _ _ _ _ applications that have been tested to _ _
Prominent providers of distributions include…

Distirubiton, collection of open source Apache. work together

Cloudera
Hortonworks
Amazon
Google
MS Azure

Distribution is a collection of open source Apache that have been tested to work together

Prominent providers of distributions include
-cloudera
-MS Azure
-hortonworks
-Google
-Amazon

How well did you know this?

Not at all

Perfectly

Hadoop
The _ _ software library is a _ that allows for the _ _ of large sets across _ _ _ using simple _ _

apache hadoop , framework. distrubted processing, clusters of computer, programming models

The apache hadoop softtware libary is a framework that allows for the distributed programming of large sets across clusters of computer using simple programming models

How well did you know this?

Not at all

Perfectly

Hadoop Characterisitcs
_ data storage
inexpensive _
combines up to_ _ _ for _ performance

inexpensive
servers
1000, distributed servers, massive

How well did you know this?

Not at all

Perfectly

Trends- Storage
Is a _
only getting _ and _ _
normalization vs _
Data schema on-_ vs _ on-write
data _
solid-_

commodity
cheaper, more abundant
denomrizliation
on-read, schema
lakes
state

is a commodity
only getting cheaper and more abundant
normaization vs denormalization
data schme on read vs sschema on write
data lakes
solid state

How well did you know this?

Not at all

Perfectly

Trends-memory
Is a _
only getting _ and _ _
the _ the _
In-memory _ _ from _ _ of _
_ _ needs, depending on the side of _, lots of _

is a commodity
only getting cheaper and more abundant
the more the merrier
In memeory computing benefitting from massive allocation of RAM
Hadoop namenode needs, depending on the size of cluser, lot of RAM

How well did you know this?

Not at all

Perfectly

Distributed Processing
More cheapter to store _ _ of data using _ _ architeure

Think of _ _ on severs. At a large corporation there are massive quantities of _ _ _. They are used for analysis of _ _, _ _, _ _, _ _ and tuning, and more

Analyzing all of that _ stored data requries _ _ for analysis

massive quantities, big data

log files, log files (petabytes), security breaches, clickstream analysis, website statistics, infrasture analysis, and more

cheaply, different application

-more cheaper to store massivmee quanities of data using big data architecture
-think of log files on servers. At a large corportation there are massive quanities of log files (petabytes). They are used for analysis of security breaches, clickstream analysis, website stastics, infrastrue analysis and turning and more.
-analysing all of that cheaply stored data reequires a different application for analysis

How well did you know this?

Not at all

Perfectly

Hadoop Distributed File System
_ is the data storage layer for a _ _
Inexpensive reliable store for _ _ _ _
uses low cost industry _ _
data is _. and _ to multiple _ of _

HDFS, Hadoop system
massive amounts of data
standard hardware
replicated, distributed, nodes, storage

HDFS is the data storage layer fora Hadoop System
Inexpensive relaible storage for massive amounts of data
uses low cost industry standard hardware
data is replicated and distributed to nodes of hardware

How well did you know this?

Not at all

Perfectly

Hadoop application

HDFS the _ _ _
-distributes _ _ across the cluster in a reduntant manner
-Data is lost _ _

YARN is _ _ _ _
-Manages cluster resources for the _ _ _

MapReduce
-Base code that handles all _ _
-Maps data to / _

Hadoop file system
data blocks
cluster termination
Yet another resource negotiator
collections of applications
data processing
key/value pairs

HDFS the hadoop file system
Disteibutes data blocks across the clusser in a redunatnt manner
data is lost in cluster termination
yarn is yet another resource negotiatior
managers cluser resources for the collection application
base code that handles all data processing
Maps to key/value pairs

How well did you know this?

Not at all

Perfectly

Map Reduce
Mechanism for bringing the processing to _ _ _
Maps where data is stored on each _ _
contains a master job tracker manaaging _ _
uses the task tracker to execute tasks on each _ _

the stored data
HDFS node
task resources
HDFS node

Mechanism for bringing the processing to the stored data
Maps where data is on each HDFS node
contains a master job tracker managing task resources
uses the task track to execute tasks on each HDFS node

How well did you know this?

Not at all

Perfectly

What is EMR?
EMR stands for _ _ _
EMR is a managed hadoop service by _
AWS Distributions provide support for the most popular _ _ applications like _, _, _, _, and _

Elastic Map Reduce
AWS
open source
Spark, hive, HDFS, presto and flink

How well did you know this?

Not at all

Perfectly

EMR Cluster Architecture
Master Node “leader node”
-manages _ _
-tracks status of _
-Monitors _ _
-Single _ _

Core Node
-Saves _ _
-Used in _ _ _
-Runs _
-can be scaled _ or _

Task Node
-runs _ _
-does not store _
-_ instances can be used

Master Node “leader node”
-manages the cluster
-tracks status of tasks
-monitors cluster health
-Single EC2 insance

Core Node
-Saves HDFS data
-Used in multi node clusters
-runs tasks
-can be scaled up or down

Task node
-runs tasks only
-does not store data
-spot instances can be used

How well did you know this?

Not at all

Perfectly

Transient versus Long clusters
_ cluster terminate once all steps are complete
- it _ _ _
-perform work and then shut down _ _

_ are manually terminated
-Functions as a data warehouse with periodic processing on _ _ _
-Task nodes can be scaled using _ _
-Setup with termination protection on - _

Study These Flashcards

Transiet clusters terminate once all steps are complete
-loading data, prcessing, and storing data
-perfrom work and then shut down saving costs
Long running are manually terminated
-functions as a data warehouse with periodic processing on large data sets
-tasks nodes can be scaled using spot instances
-setup with termination protection on and auto termination off

Using EMR
_ and _ are part of cluseter creation
- users connect directly to the _ _ to run jobs
-configure steps in a _
-submit ordered steps via the _

Study These Flashcards

frameworks and application
master node
cluster
console

Frameworks and application are part of cluster creation
users connect directly to the master node to run jobs
configure steps in a cluseter
submit ordered steps via the console

EMR and AWS integration
_ provides the EMR nodes
_ provides a virtual network for nodes
_ stores input and output data
_ _ to schedule and start clusters
_ to configure permissions

Study These Flashcards

EC2
VPC
S3
Data Pipeline
IAM

EMR capabilites
_ is a by the hour service charge
_ is a seperate set of charges
automatically provisions core nodes when they _
cluster core nodes be resized _ _ _
core nodes be removed but risk _ _
task nodes can be added on the _

Study These Flashcards

EMR is a by the hour service charge
EC2 is a seperate set of charges
auotmaitcllay provisons core nodes when they fail
cluster core nodes can be resized on the fly
core nodes can be removed but risk data loss
tasks nodes can be added on the fly

HIVE -It is a tool that provides SQL quering of data stored in _ or Hbase -Accesed using the _ _ -Allows for easy _ _ _ -Transforms log file data into structures like _ -Consits of a schema in the Metastore and data in _

It is a tool that provides SQL querying of data stores in HDFS or HBase Accessd using the HiveQL lanaguage Allows for easy ad-hoc quereies transforms log file data into sturcures like tables consitsof a schedma in the Metastore and data in HDFS

Success of HIVE Uses familiar SQL syntax for _ _ Interactice and scalable on a _ _ _ Works very well for _ _ _ _/_ driver

OLAP queries big data cluster data warehouse application JDBC/ODBC uses familar SQL syntax for OLAP queries inveractive and scalable on a big data cluser works very well for data warehouse application JDBC/ODBC driver

Hive Metastore and Glue ___ shares schema across EMR and other AWS services __ is used to create data lakes

Glue Data Catalog

Schema on read -Verfied data orgnization when query is _ -Provides much faster loading as strucuture is not _ -Multiple schemas serving different needs for the __ -Better options when schema is not known at ___

-Verfiies data organization when a query is issued -Provides much faster loading at strucutre is not validted -Multiple shcmeas servng differneeds for the same data -Better option when is not known at loading time

HIVE query example -Creates a new table(file) for user_active -Selects all users -From table called user -that have an active indicatior

INSERT OVERWRITE TABLE. user_active SELECT user.* FROM user WHERE user.active = A;

Loading Data into Hive -Create a hive table called records to store data -identify the metadata type with each filed -define the strucutre as tab delimited ------------------------------------------------ -Load data into the table as a local file -OVERWRITE will replace the exisitng file

CREATE TABLE records (year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; ---------------------------- LOAD DATA LOCAL INPATH 'input/ncdc/micro-tab/sample.txt' OVERWRITE INTO TABLE records;

Query Against a Hive Table -SELECT using SQL familar commands and specification -MAX defines the maximum temperature for each year -FROM defines our records table -WHERE ensures clean data selections -GROUP BY issues the value by year

hive> SELECT year, MAX(temperature) > FROM records >WHRE temperature != 1999 and quality IN (0, 1, 4, 5, 9) > GROUP BY year;

File Storage Formats ___ are avialbe for storage 5 examples

Multiple file formats are available for storage --as- format is the nomenclatrue --as-textfile --as-sequencefile --as-parquetfile --as-avrodatafile ------------------------------- CREATE TABLE tablename (colname DATATYPE,...) ROW FORMAT DELIMITED FIELDS TERMINED BY char STORED AS format

Binary Column Formats Column oriented formats work best when on a few columns are used in _/_ -Hive provides native support for _ -STORES AS..(PARQUET) or (ETC...)

-queries/calculations -parquet -------------------------------------- CREATE TABLE users_parquet STORED AS PARQUET AS SELECT * FROM users;

S3DistCP Copy Transaction for copying large amounts of data from _ to _ Copies in a distributed manner using ___ Provides parallel path copying across _

transaction for copying large amounts of data from S3 to HDFS Copies in a distributed manner using MapReduce Provides parallel path copying across buckets ------------------------------------------- s3-dist-cp --srcs=s3://jb101000/data --dest=hdfs:///data

EMR SERVICE EMR is the encompassing big data service in __ Which applications are in a distribution for EMR service?

EMR is the encompassing big data service in AWS Hadoop,Spark, Hive appliction are in distribution

Clusters Clusters is a computing network of __ ____ in the cluster store hadoop files

Cluster is a computing network of computer Core nodes in the cluster store hadoop files

Select Application ____ is the distribution of application tested together Apache application like __ and __ are selected

EMR release is the distribution application tested together Apache applications like Spark and Hive are selected

Configure Cluster Nodes ___ configuration is for the master node __ configuration is for storage and processing ___ configuration is for processing

primary node confgiruation task

Cluster Configuration Master node -____ is the standardard configuration Core/Task nodes -____ is a good choice -external dependencies use ___ -improved performance with ___ ___ is a good choice for task nodes as they can scale. avoid using master and core notes as it may cause __

master node m5.xlarge is the standard configuration core/task nodes -m5.xlarge is a good choice -external dependencies use t2.medium -improved performance with m4.xlarge spot instance is a good choice for task nodes as they can scale. avoid using master and core notes as it may cause data loss

Virutal privat cloud _ is created as protected network for the cluster

VPC

EMR Processing Logs __ is created to capture the processing logs for the cluster ___ are captured in the same bucket

S3 bucket error log files

EMR security ___ grant or deny permissions to control cluster access ___ control access to EMFRS data based on user ___ are attached to IAM roles ____ provides a secure connection to the command line interference ___provides secure user authenication ______ setting prevents public access to data stored on your EMR cluster

IAM policies grant or deny permissions to control cluster access IAM roles control access to EMFRS data based on user IAM policies are attached to IAM roles SSH provides a secure connection to the command line interfernce Kerebos provides secure user authenication Block Public Access setting prevents public access to data stored on your EMR cluster

Define an IAM role

Asssign IAM roles to the cluser entities you create and assign specific permissions to that allow trusted identities such as workforce identities and applications to perform actions in AWS.

Define Security for the Cluster Define the __ approach Provide the key pair for ___ client access to the cluster

Define the encryption approach Provide the key pari for SSH client access to the cluster

Create the cluster Selecting the ___ button engages the configuration options and creates the cluster

create cluster

Cluster Operations The cluser is now avialbe for access by ___ Any ___ selected by the cluster creation can now be enaged at the ______

The cluster is now available for access by SSH clients Any Apache application selected by the cluster cretion can now be enagated at the command line

Spark Application -Apache spark is a fast and general engine for _ _ _ _ -in memory _ -optimized _ _ -Spark SQL _ -Machine Learning __ -Spark _

apache spark is a fast and general engine for large scale data processing in memory catching otpmized query execution spark SQL queries Machine Learnning MLlib Spark Streaming

Spark Applications Consits of SparkContext _ process and _ YARN or Spark can be the cluster _

Consists of SparkContext driver process and executors Yarn or Spark can be the cluste manager

EMR Notebook AWS notbook (__) backed up to _ data storage provision clusters from the _ accessed via _ _ hosted inside a _

Aws notebook jupyte backed up to S3 data storage provison cluster from the notebook accessed via AWS console hosted inside a VPC

Cluster is a computing network of server nodes  EMR is the managed hadoop service provided by AWS  Distribution is the collection of applications available in the cluster  Applications like Spark, Hive, and Hadoop are provided by the cluster  Hive is an application that provides a metadata structure over the hadoop file system  Spark is an in-memory high speed analytical application

Elastic Map Reduce Flashcards

(51 cards)