AWS Big Data Speciality Flashcards

1
Q

Spark Patterns and Anti Patterns

A

Spark Patterns:

  1. High performance fast engine for processing large amounts of data (In-memory, Disk)
  2. Faster then running queries in HIVE
  3. Run queries against live data
  4. Flexibility in terms of languages

Spark Anti Patterns:

  1. It is not designed for OLTP
  2. Not fit for batch processing
  3. Avoid large multi-user reporting environment with high concurrency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Kinesis Retention Periods

A

24 Hours to 7 Days

Default is 24 Hours

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

EMR Consistent View

A

EMRFS consistent view monitors Amazon S3 list consistency for objects written by or synced with EMRFS, delete consistency for objects deleted by EMRFS, and read-after-write consistency for new objects written by EMRFS.

You can configure additional settings for consistent view by providing them for the
/home/hadoop/conf/emrfs-site.xml

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DynamoDB Max number of LSI

A

5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Kinesis Firehose Handling

A
  1. S3 - Retries delivery up to 24 hours

2. Redshift & ElastiSearch : 0-7200 Seconds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Apache Hadoop Modules

A

Apache Hadoop Modules

  1. Hadoop Common
  2. HDFS
  3. YARN
  4. MapReduce
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Impala

A

Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Kinesis Consumers

A

Read data from streams:

  1. for further processing
  2. data store delivery
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Kinesis Streams

A

Kinesis Streams:

  • Receive data from the Producers
  • Replicate data over multiple availability zones for durability
  • Distribute data among the provisioned shards
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

EMR Data Compression Formats

A

Algorithm/Splittable/Comp. Ratio/Co-De Speed

  1. GZIP/No/High/Medium
  2. bzip2/Yes/Very High/Slow
  3. LZO/Yes/Low/Fast
  4. Snappy/No/Low/Very Fast
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Presto - Patterns and Anti-Patterns

A

Presto Patterns:

  1. Query different types of data sources - Relational Database, Nosql, HIVE framework, kafka stream processing
  2. High concurrency
  3. In-memory processing

Presto Anti-patterns:

  1. Not fit for Batch Processing
  2. Not designed for OLTP
  3. Not fit for large join operations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

KPL - Key Concepts

A
  • Include library and use
  • Can write to multiple Amazon Kinesis streams
  • Error recovery built-in: Retry mechanisms
  • Synchronous and asynchronous writing
  • Multithreading
  • Complement to the Amazon Kinesis Client Library (KCL)
  • CloudWatch Integration –Records In/Out/Error
  • Batches data records to increase payload size and improve throughput
  • Aggregation – multiple data records sent in one transaction; increasing the numbers of records sent per API call
  • Collection – takes multiple aggregated records from the previous step and sends them as one HTTP request; further optimizing the data transfer by reducing HTTP request overhead
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Resizing EMR Cluster

A
  • Only task nodes can be resized up or down
  • Only one master, cannot change that
  • Core nodes can only be added
  • Even with EMRFS, core nodes have HDFS for processing
  • Add task nodes, task node groups when more processing is needed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Redshift Important Operations

A

Redshift important operations:

  1. Launch
  2. Resize
  3. Vacuum
  4. Backup & Restore
  5. Monitoring
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DynamoDB Performance Metrics

A

1 Partition = 10 GB = 3000 RCU & 1000 WCU

RCU - 4KB/sec
WCU- 1 KB/sec

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DynamoDB Streams Configuration Views

A
  1. KEYS_ONLY
  2. NEW_IMAGE
  3. OLD_IMAGE
  4. NEW_AND_OLD IMAGES
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

KPL Use Cases

A
  • High rate producers

- Record aggregation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Zookeeper

A

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Coordinates distributed processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Regression Model

A
  • To predict a numerical value
  • RMSE number measures quality of a model
  • Lower RMSE better predictions
  • RMSE - Root-Mean-Square-Error

Use Cases

  1. Determine what your house is worth ?
  2. How many units of product will call ?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Kinesis Agent

A
  1. Real-time Kinesis file mediation client written in Java
  2. Streams files/tails files
  3. Handles file rotation, check pointing and retry upon failure
  4. Multiple folders/files to multiple streams
  5. Transform data prior to streaming: SINGLELINE, CSVTOJSON, LOGTOJSON
  6. CloudWatch- BytesSent, RecordSendAttempts, RecordSendErrors, ServiceErrors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Kinesis Firehose Destination Data Delivery

A
  1. S3
  2. ElastiSearch
  3. RedShift
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Machine Learning Algorithms

A
  1. Supervised Learning - Trained
    a. Classification - Is this transaction fraud?
    b. Regression - Customer life time value
  2. Unsupervised Learning - Self Learning
    a. Clustering - Market Segmentation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

EMR Cluster sizing

A
  1. Master Node -
    m3. xlarge - < 50 nodes, m3.2xlarge >50 nodes
2. Core Nodes -
Replication Factor
>10 Node cluster - 3
4-9 Node cluster -2
3 Node cluster - 1

HDFS Capacity Formula=

Data Size = Total Storage/Replication Factor

Note: AWS recommends smaller cluster of larger nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

DynamoDB Performance

A

DynamoDB Performance

  1. partitions = Desired RCU/3000 + Desired WCU/1000
  2. partitions= Data size in GB/10 GB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Quicksight Components
1. Data Set 2. SPICE - Superfast-Parallel-In-Memory-Calculation Engine - Measured in GB - 10GB /user
26
Kinesis Streams Important Points
1. Data can be emitted to S3, DynamoDB, Elastisearch and Redshift using KCL 2. Lambda functions can automatically read records from a kinesis stream, process them and send the records to S3, DynamoDB or Redshift
27
Difference between Kafta and Kinesis
In a nutshell, Kafka is a better option if: - You have the in-house knowledge to maintain Kafka and Zookeper - You need to process more than 1000s of events/s - You don't want to integrate it with AWS services Kinesis works best if: - You don't have the in-house knowledge to maintain Kafka - You process 1000s of events/s at most - You stream data into S3 or Redshift - You don't want to build a Kappa architecture - Max payload size 1 MB
28
Big Data Visualization
A. Web based Notebooks - 1. Zepplin 2. Jupyter Notebook - Ipython B.D3.JS - Data Driven Documents
29
IoT Limits
- Max 300 MQTT CONNECT requests per second - Max 9000 publish requests per second - 3000 in - 6000 out - Client connection payload limit 512KB/s - Shadows deleted after 1 year if not updated or retrieved AWS IoT - Max 1000 rules per AWS account - Max 10 actions per rule
30
Getting data into Kinesis - Third Party Support
- Log4J Appender - Flume - Fluentd
31
Kinesis Producer Library (KPL)
- API - multiple streams - multithread (for multicore) - synchronous and asynchronous - complement to KCL (kinesis client library) - cloudwatch - records in/out/error
32
Methods to load data into Firehose
1. Kinesis Agent | 2. AWS SDK
33
Hue
``` Open source web interface for analyzing data in EMR Amazon S3 and HDFS Browser Hive/Pig Oozie Metastore Manager Job browser and user management ```
34
Firehose Data Transformation
With the Firehose data transformation feature, you can now specify a Lambda function that can perform transformations directly on the stream, when you create a delivery stream. When you enable Firehose data transformation, Firehose buffers incoming data and invokes the specified Lambda function with each buffered batch asynchronously. To get you started, we provide the following Lambda blueprints, which you can adapt to suit your needs: - Apache Log to JSON - Apache Log to CSV - Syslog to JSON - Syslog to CSV - General Firehose Processing
35
EMR HDFS Parameters
1. Replication factor - 3 times 2. Block Size: 64 MB - 256 MB 3. Replication factor can be configured in hdfs-site.xml 4. Block size and Replication factor set per file
36
ES Stability
3 master nodes
37
Redshift Data Loading - Data Format
1. CSV 2. Delimited 3. Fixed Width 4. JSON 5. Avro
38
Tracking Amazon Kinesis Streams Application State
For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application's state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table. For example, if your Amazon Kinesis Streams application does frequent checkpointing or operates on a stream that is composed of many shards, you might need more throughput.
39
Tracking Amazon Kinesis Streams Application State
For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application's state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table. For example, if your Amazon Kinesis Streams application does frequent check pointing or operates on a stream that is composed of many shards, you might need more throughput.
40
Redshift : CloudHSM Vs. KMS - Security
CloudHSM 1. $16k/Year + $5K upfront 2. Need to setup HA & Durability 3. Single tenant 4. Customer managed root of trust 5. Symmetric & Asymmetric encryption 6. International Common Criteria EAL4 and U.S. Government NIST FIPS 140.2 KMS 1. Usage based procing 2. Highly Available & Durable 3. Multi-tenant 4. AWS managed root of trust 5. Symmetric encryption only 6. Auditing
41
Machine Learning Summary
1. Binary - AUC (Area Under the Curve) - True Positives, True Negatives, False Positives and False Negatives Only model that can be fine tuned by adjusting the score threshold 2. Multiclass - Confusion Matrix - (Correct Predictions & Incorrect Predictions) 3. Regression - RMSE (Root Mean Square Error) - lower the RMSE the prediction is better
42
Kinesis Streams - Load/Get Data Options
1. Kinesis Producer Library - Producers 2. Kinesis Client Library - KCL 3. Kinesis Agent 4. Kinesis REST API
43
Redshift - Vacuum - Best Practices
1. Vacuum is I/O sensitive 2. Perform Vacuum after bulk deletes, data loading or after updates 3. Perform Vacuum during lower period of activity or during your maintenance windows 4. Vacuum utility is not recommended for tables over 700GB 5. Don't execute Vacuum Loading data is sort order Use time series table
44
WLM - Type of Groups
1. User Group | 2. Query Group
45
Redshift Table Design - Constraints
Maintain data integrity Types for constraints 1. Primary Key 2. Unique 3. Nut null/null 4. References 5. Foreign Key Except NotNull/Null we can't enforce any constraints
46
Hunk
Hunk is a web-based interactive data analytics platform for rapidly exploring, analysing and visualizing data in Hadoop and NoSQL data stores
47
Types of Analysis
1. Pre-processing: filtering, transformations 2. Basic Analytics: Simple counts, aggregates over windows 3. Advanced Analytics: Detecting anomalies, event correlation 4. Post-processing: Alerting, triggering, final filters
48
Jupyter Notebook
Jupyter is a web-based notebook for running Python, R, Scala and other languages to process and visualize data, perform statistical analysis, and train and run machine learning models
49
Kinesis Streams - Kinesis Connectors available for
DynamoDB S3 Elastisearch Redshift
50
Redshift Important System Tables
1. STL_LOAD_ERRORS | 2. STL_LOADERROR_DETAIL
51
Apache Ranger
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.
52
Redshift - Encryption in Transit
Redshift - Encryption in transit 1. Create parameter group 2. SSL Certificate
53
Machine Learning Use Cases
1. Fraud Detection 2. Customer Service 3. Litigation/Legal 4. Security 5. Healthcare 6. Sports
54
Kinesis Firehose Important Parameters
Buffer Size - 1 MB - 128 MB Buffer Interval - 60-900 Seconds Parameters for transformation: 1. Record ID 2. Result : OK, Dropped & Processing Failed 3. Data
55
Oozie
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
56
Redshift - Encryption at Rest
1. KMS 2. HSM (CloudHSM & On-prem HSM) Note: In Redshift it will encrypt Data Blocks, system Metadata & snapshots
57
Tez
- Tez is an engine to process complex Directed Acyclic Graph (DAG) - It can be used in place of Hadoop for running Pig and Hive - Runs on top of YARN
58
SQS vs Kinesis Streams
SQS - Message Queue | Kinesis Streams - Real time processing
59
DynamoDB Performance Points
1. Use GSI 2. Use Burst /spread periodic batch writes/SQS managed write buffer - In case of uneven writes 3. Use Cashing - In case of uneven Read
60
Redshift features
1. Petabyte scale data warehouse services 2. OLAP & BI Use cases 3. ANSI SQL Compliance 4. Column Oriented 5. MPP Architecture 6. Node Types: a. Dense Compute (DC1 and DC2) b. Dense Storage (DS2) Single AZ Implementation
61
Machine Learning Limits
Max observation size (target+attributes): 100KB Max training data size: 100GB Max batch predictions data size: 1TB Max batch predictions data records: 100 million Max columns in schema: 1000 Real-time prediction endpoint TPS: 200 Number of classes for multiclass ML models: 100
62
Kinesis: key features
Kinesis : Key Features 1. Real time data streaming 2. Ordered record delivery 3. Replication to three availability zone 4. de-coupled from consuming application 5. replay data 6. zero downtime scaling 7. pay as you go 8. parallel processing - multiple producers and consumers
63
EMR - S3DistCP
In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features. 1. Copy or move files without transformation 2. Copy and change file compression on the fly 3. Copy files incrementally 4. Copy multiple folders in one job 5. Aggregate files based on a pattern 6. Upload files larger than 1 TB in size 7. Submit a S3DistCp step to an EMR cluster
64
Spark Components
1. Spark Core - Dispatch & Scheduling tasks 2. Spark SQL - Execute low-latency interactive SQL query against structured data 3. Spark Streaming - Stream processing of live data streams 4. MLib - Scalable Machine Learning Library 5. GraphX - Graphs parallel computation
65
Flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tuneable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
66
Redshift Cluster Resizing
1. Creation of Destination Cluster 2. Source cluster restart and enter into read-only mode 3. Reconnect to source cluster to run queries in the read-only mode 4. Redshift start copy data from source to target cluster 5. Once copy is over, Redshift updates the DNS endpoint to target cluster 6. Source cluster will be decommissioned
67
SDK Use Cases
Low rate producers Mobile apps IoT devices Web clients
68
Redshift Table Design - Distribution Style
1. Even Rows distributed across the slice regardless of value in a particular column Default Distribution style 2. Key Distribute data evenly among slices Collocate matching raws in the same slice Improve the performace Use cases - Join tables, larger fact tables All ``` Copy of entire table is stored in all nodes Need more space due to duplication Use Case: - Static Data - Small size of table - No common distribution key ```
69
Redshift - Slices Guidelines
No. of data files should be equal to no. of slices or multiple of the no. of slices i.e. 4 slices = 4 files or 8 files 32 slices = 32 files or 64 files File compression - gzip, lzop, bzip2
70
Glacier - Vault Lock Policy
Amazon Glacier Vault Lock allows you to easily deploy and enforce compliance controls for individual Amazon Glacier vaults with a vault lock policy. Use Cases 1. Time based retention 2. Undeletable
71
Kinesis Analytics Use Cases
- Mobile app live monitoring - Clickstream analytics - Logs - Metering records - IoT data
72
EMR - Resizing Cluster
1. Manually - Options 1. Terminate at instance hour 2. Terminate at task completion 2. Autoscaling
73
How to decide between more nodes vs. bigger nodes generally - Amazon Redshift ?
Bigger nodes are better for long running queries eg: fewer dc1.8xlarge better than more dc1.large More nodes are better for short running queries eg: more dc1.large better than fewer dc1.8xlarge
74
WLM Settings
``` 1. Reboot of RedShift Cluster required to reflect the changes User Group Query Group User Group Wildcard Query Group Wildcard ``` 2. No reboot required for parameters Concurrency % of Memory Timeout
75
What are the different types of KPL Batching?
1. Collection - Group of stream records and batching them to reduce HTTP requests 2. Aggregation - Allows you to combine multiple user records into a single stream
76
Data lake
S3
77
Kinesis Stream - KPL Anti-pattern
In case producer application/use case can't incur an additional processing delay
78
DynamoDB - Integration with AWS Services
1. Redshift - Copy command to transfer the data 2. EMR - Using HIVE can read & write data from DynamoDB 3. S3 - Export & Import to S3 4. Datapipeline - Mediator to copy the data to and from S3 5. Lambda - Event based action 6. Kinesis streams - Streaming Data 7. EC2 Instance - Streaming Data
79
HIVE Patterns
1. Process and analyse logs 2. Join very large tables 3. Batch jobs 4. Ad-hoc interactive queries
80
Difference between Supervised and Unsupervised Learning
Unsupervised Learning: 1. Unlabeled Data 2. No knowledge of output 3. Self guided learning algorithm 4. Aim: to figure out the data patterns and grouping Supervised Learning: 1. Labelled Data 2. Desired outcome is known 3. Providing the algorithm training data to learn from 4. Aim: Predictive Analytics
81
Sqoop
Sqoop is a tool for data migration between Amazon S3, Hadoop, HDFS, and RDBMS databases including Redshift - Parallel data transfer for faster export and ingestion - Batched transfer, not meant for interactive queries
82
Redshift Data Model
Redshift Data Model 1. Star Schema - Consist of one or more fact tables referencing any number of dimension tables - Fact Table - Consists of measurements metrics of fact of a business process - Dimension Table- Stores dimensions that describes objects in the fact table
83
EMR - Data at Rest Encryption
1. EC2 Cluster Nodes a. Open Source HDFS Encryption b. LUKS Encryption 2. Foe EMRFS on S3 a. SSE-S3 b. SSE-KMS c. CSE-KMS d. CSE- Custom
84
Quicksight Visualizations
Quicksight Visualizations - 20 visuals per analysis - Quicksight can determine most appropriate visual types for you - Dimensions & Measures (fields)
85
Lambda Patterns
Lambda Patterns - Real-time file processing - Real-time stream processing - Extract, transform, and load - Replace cron - Process AWS events
86
SQS Features
- 256 KB Messages - Messages can be retained for 14 Days - Two important Architectures 1. SQS Priority Architecture 2. Fanout Architecture
87
EMR Security
Controls: 1. Security Groups a. Default & b. EMR Managed 2. IAM Roles - Default Role, EC2 Default Role & Autoscaling Default Role 3. Private Subnet 4. Encryption at Rest 5. Encryption in transmit
88
EMR Anti-Patterns
Small data sets – Amazon EMR is built for massive parallel processing; if your data set is small enough to run quickly on a single machine, in a single thread, the added overhead to map and reduce jobs may not be worth it for small data sets that can easily be processed in memory on a single system. ACID transaction requirements – While there are ways to achieve ACID (atomicity, consistency, isolation, durability) or limited ACID on Hadoop, another database, such as Amazon RDS or relational database running on Amazon EC2, may be a better option for workloads with stringent requirements. Amazon
89
Kinesis Streams Features
Kinesis Streams Features: - Streams receive data from the Producers - Replicate data over multiple availability zones for durability - Distribute data among the provisioned shards
90
EMR HDFS Parameters
1. Replication factor - 3 times 2. Block Size - 64 MB - 256 MB 3. Replication factor can be configured in hdfs-site.xml 4. Block size and Replication factor set per file
91
EMR File Formats
EMR File Formats 1. Text 2. Parquest 3. ORC 4. Sequence 5. AVRO Keep GZIP files 1-2 GB Range Avoid smaller files <100 MB s3Distcp can be used to copy data between S3 HDFS or viceversa
92
IoT Authentication
IoT Authentication: 1. X.509 Certificate 2. Cognito Identity
93
EMR Storage Options
1. Instance Store 2. EBS for HDFS 3. EMRFS - S3 EMRFS & HDFS can be used together Copy data from S3 to HDFS using S3DistCP
94
Data Pipeline Components
1. Data Nodes 2. Activities 3. Preconditions 4. Schedules
95
Redshift Table Design - Compression
1. Automatic - Recommended by AWS 2. Manual Use "Encode" to compress column
96
Kinesis Streams - Best Practices
- Start off with multiple shards - Have multiple consumers for A/B testing without downtime - Dump data to S3 when possible; it’s cheap and durable - Use the same stream for data archival and analytics - Lambda for transformations and processing - Use logic in consumer if you need only-once delivery; keep state in DynamoDB - Tag streams for cost segregation
97
Redshift Deep Copy
A deep copy recreates and repopulates a table by using a bulk insert, which automatically sorts the table. If a table has a large unsorted region, a deep copy is much faster than a vacuum. The trade o􀃗 is that you cannot make concurrent updates during a deep copy operation, which you can do during a vacuum. Options: 1. To perform a deep copy using the original table DDL 2. To perform a deep copy using CREATE TABLE LIKE 3. To perform a deep copy by creating a temporary table and truncating the original table Note: 1st method is preferred over others 2
98
Redshift - Encryption Keys Hierarchy
1. Master Key 2. Cluster Encryption Key 3. Database Encryption Key 4. Data Encryption Key
99
Redshift Data Loading - Manifest
1. Load required files only 2. Load files from different bucket 3. Load files with different prefix 4. JSON format
100
Spark on EMR
1. Spark framework replaces MapReduce framework 2. Spark processing engine will be deployed in each node of cluster 3. Spark SQL can interact with S3 or HDFS
101
Redshift WLM Features
Redshift WLM Features 1. Manages separate queue for long running and short running queries 2. Configure memory allocation to queues 3. Improve performance & expenses
102
Which data Ingestion Tool is similar to Kinesis?
Kafta
103
Kinesis - Producers
- Producers add data records to Kinesis streams - A data record must contain: 1. Name of the stream 2. Partition Key 3. Data Content - Single data records can be added using the -PutRecord API - Multiple data records can be added at one time using the PutRecords API
104
KCL - Features
Consumes and processes data from an Amazon Kinesis stream KCL Libraries available for Java, Ruby, Node, Go, and a Multi-Lang Implementation with Native Python support Creates a DynamoDB table (with the same name as your application) to manage state Make sure you don’t have name conflicts with any existing DynamoDB table and your app name (same region) Multiple KCLs can seamlessly works on the same or different streams Checkpoints processed records KCLs can load balance among each other Automatically deal with stream scaling like shard splits and merges Key performance indicators of the KCL like records processed (size, count, latency, age) MillisBehindLatest - How far behind the KCL is
105
What is Mahout?
Mahout is a machine learning library with tools for clustering, classification, and several types of recommenders, including tools to calculate most similar items or build item recommendations for users
106
Redshift Important Performance Metrics
Redshift Important Performance Metrics: 1. Number of nodes, processors or slices 2. Node Types 3. Data Distribution 4. Data Sort Order 5. Dataset size 6. Concurrent Operations 7. Query Structure 8. Code Compilation
107
EMR Important Web Interfaces
YARN ResourceManager http://master-public-dns-name:8088/ YARN NodeManager http://slave-public-dns-name:8042/ Hadoop HDFS NameNode http://master-public-dns-name:50070/ Hadoop HDFS DataNode http://slave-public-dns-name:50075/ Spark HistoryServer http://master-public-dns-name:18080/
108
What are the differences between LSI and GSI?
Global secondary index (GSI.html) — an index with a partition key and a sort key that can be different from those on the base table. - A global secondary index is considered "global" because queries on the index can span all of the data in the base table, across all partitions. - It can be created any time past table creation - Not shares RCU & WCU with tables LSI - an index that has the same partition key as the base table, but a di􀃗erent sort key. - A local secondary index is "local" in the sense that every partition of a local secondary index is scoped to a base table partition that has the same partition key value. - It can be only created during table creation - Shares RCU & WCU with tables
109
Kinesis Streams - Shard Capacity
Kinesis Streams - Shard Capacity: - 1 MB/sec Data Input - 2 MB/sec Data Output - 5 transactions/sec for read - 1000 records/sec for writes
110
Which API is used to add: 1) Single Data Records? 2) Multiple Data Records?
1) PutRecord for single data records | 2) PutRecords for multiple data records
111
What is Apache Ranger?
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.
112
Redshift unloading data - Encryption Option
1. SSE-S3 2. SSE-KMS 3. CSE-CMK
113
What is Kinesis Analytics?
1. Amazon Kinesis Analytics enables users to run standard SQL queries over live streaming data 2. Readily query Kinesis Streams/Firehose data and export the output to destinations like S3
114
EMR Use Cases
1. Log Processing/Analytics 2. ETL 3. Clickstream Analytics 4. Machine Learning
115
What is Spark RDD?
Resilient Distributed Dataset (RDD) is core of Spark
116
Binary Classification Model
- To predict binary outcome - AUC (Area under curve) measures the prediction accuracy of model(0 to 1) - Important parameters (Histogram, cut-off threshold)
117
Kinesis Producer Library (KPL)
- API - Multiple streams - Multithread (for multicore) - Synchronous and asynchronous - Complement to KCL (kinesis client library) - Cloudwatch - records in/out/error
118
Multiclass Classification Model
- To generate predictions for multiple classes - F1 score measures quality of a model (0 to 1) - Confusion Matrix is used
119
HIVE on EMR Integration
S3 | DynamoDB
120
What is Elastisearch?
- It is distributed multi tenant-capable full text search engine - HTTP Web Interface - It can be integrated with Logstash & Kibana ELK Stack 1. Logstash - Data collection & log parsing engine 2. Kibana - Open source data visualization and exploration tool
121
EMR - Long Running vs Transient Cluster
Long Running - 1. Cluster stays up & running for queries against HBASE 2. Jobs on the cluster run frequently Transient Cluster - 1. Temporary cluster that shuts down after processing 2. Good use case is Batch Job
122
Quicksight Visualization Types
1. Autograph 2. Bar-chart - Vertical & Horizontal 3. Line Charts - Gross sales by month - Gross sales and net sales by month - Measure of a dimension over a period of time 4. Pivot Table - A way to summarize data 5. Scaler Plot - Two or Three measures of a dimension 6. Tree Map - One to two measure for a domain 7. Pie Chart - Compare values for diff dimensions 8. Heat Map - Identify trends & outliers 9. Story - Create narrative by presenting iteration 10. Dashboard - Read only snapshot of analysis
123
Redshift - Cross Region Snapshots
Cross Region KMS Encrypted Snapshots for KMS encrypted clusters 1. Snapshot encrypted
124
Redshift Table Design - Key Factors
1. Architecture 2. Distribution Styles 3. Sort Keys 4. Compression 5. Constraints 6. Column Sizing 7. Data Types
125
What is Hcatalog?
Hcatalog is a table storage manager for Hadoop It can store data in any format and make it available to external systems like Hive and Pig It can write files in many formats like RCFile, CSV, JSON, and SequenceFile, and ORC or custom formats
126
What is Redshift Vacuum?
Vacuum helps to recover the space and sort the table. Vacuum Options: Full, Sort, Delete Note: In case of any update and deletion of any row Redshift will not free up the space
127
What is Pig?
Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by Pig Latin, a SQL-like language which allows users to structure, summarize, and query data.
128
Elastisearch Use Cases
1. Logging & Analysis 2. Distributed document store 3. Realtime application monitoring 4. Clickstream weblog ingestion
129
Redshift Table - Sort Keys
1. Single 2. Compound 3. Interleaved
130
What is Zepplin?
``` - Zeppelin is a web-based notebook that enables interactive data analytics - Ingestion, discovery, analytics, visualization and collaboration - Connectors for HDFS/Hbase/Hive/Spark Flink PostgreSQL/Redshift ElasticSearch ```