AWS Big Data Speciality Flashcards

Question

Quicksight Components

Answer 1

1. Data Set 2. SPICE - Superfast-Parallel-In-Memory-Calculation Engine - Measured in GB - 10GB /user

Answer 2

1. Data can be emitted to S3, DynamoDB, Elastisearch and Redshift using KCL 2. Lambda functions can automatically read records from a kinesis stream, process them and send the records to S3, DynamoDB or Redshift

Answer 3

In a nutshell, Kafka is a better option if: - You have the in-house knowledge to maintain Kafka and Zookeper - You need to process more than 1000s of events/s - You don't want to integrate it with AWS services Kinesis works best if: - You don't have the in-house knowledge to maintain Kafka - You process 1000s of events/s at most - You stream data into S3 or Redshift - You don't want to build a Kappa architecture - Max payload size 1 MB

Answer 4

A. Web based Notebooks - 1. Zepplin 2. Jupyter Notebook - Ipython B.D3.JS - Data Driven Documents

Answer 5

- Max 300 MQTT CONNECT requests per second - Max 9000 publish requests per second - 3000 in - 6000 out - Client connection payload limit 512KB/s - Shadows deleted after 1 year if not updated or retrieved AWS IoT - Max 1000 rules per AWS account - Max 10 actions per rule

Answer 6

- Log4J Appender - Flume - Fluentd

Answer 7

- API - multiple streams - multithread (for multicore) - synchronous and asynchronous - complement to KCL (kinesis client library) - cloudwatch - records in/out/error

Answer 8

1. Kinesis Agent | 2. AWS SDK

Answer 9

``` Open source web interface for analyzing data in EMR Amazon S3 and HDFS Browser Hive/Pig Oozie Metastore Manager Job browser and user management ```

Answer 10

With the Firehose data transformation feature, you can now specify a Lambda function that can perform transformations directly on the stream, when you create a delivery stream. When you enable Firehose data transformation, Firehose buffers incoming data and invokes the specified Lambda function with each buffered batch asynchronously. To get you started, we provide the following Lambda blueprints, which you can adapt to suit your needs: - Apache Log to JSON - Apache Log to CSV - Syslog to JSON - Syslog to CSV - General Firehose Processing

Answer 11

1. Replication factor - 3 times 2. Block Size: 64 MB - 256 MB 3. Replication factor can be configured in hdfs-site.xml 4. Block size and Replication factor set per file

Answer 12

3 master nodes

Answer 13

1. CSV 2. Delimited 3. Fixed Width 4. JSON 5. Avro

Answer 14

For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application's state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table. For example, if your Amazon Kinesis Streams application does frequent checkpointing or operates on a stream that is composed of many shards, you might need more throughput.

Answer 15

For each Amazon Kinesis Streams application, the KCL uses a unique Amazon DynamoDB table to keep track of the application's state. If your Amazon Kinesis Streams application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table. For example, if your Amazon Kinesis Streams application does frequent check pointing or operates on a stream that is composed of many shards, you might need more throughput.

Answer 16

CloudHSM 1. $16k/Year + $5K upfront 2. Need to setup HA & Durability 3. Single tenant 4. Customer managed root of trust 5. Symmetric & Asymmetric encryption 6. International Common Criteria EAL4 and U.S. Government NIST FIPS 140.2 KMS 1. Usage based procing 2. Highly Available & Durable 3. Multi-tenant 4. AWS managed root of trust 5. Symmetric encryption only 6. Auditing

Answer 17

1. Binary - AUC (Area Under the Curve) - True Positives, True Negatives, False Positives and False Negatives Only model that can be fine tuned by adjusting the score threshold 2. Multiclass - Confusion Matrix - (Correct Predictions & Incorrect Predictions) 3. Regression - RMSE (Root Mean Square Error) - lower the RMSE the prediction is better

Answer 18

1. Kinesis Producer Library - Producers 2. Kinesis Client Library - KCL 3. Kinesis Agent 4. Kinesis REST API

Answer 19

1. Vacuum is I/O sensitive 2. Perform Vacuum after bulk deletes, data loading or after updates 3. Perform Vacuum during lower period of activity or during your maintenance windows 4. Vacuum utility is not recommended for tables over 700GB 5. Don't execute Vacuum Loading data is sort order Use time series table

Answer 20

1. User Group | 2. Query Group

Answer 21

Maintain data integrity Types for constraints 1. Primary Key 2. Unique 3. Nut null/null 4. References 5. Foreign Key Except NotNull/Null we can't enforce any constraints

Answer 22

Hunk is a web-based interactive data analytics platform for rapidly exploring, analysing and visualizing data in Hadoop and NoSQL data stores

Answer 23

1. Pre-processing: filtering, transformations 2. Basic Analytics: Simple counts, aggregates over windows 3. Advanced Analytics: Detecting anomalies, event correlation 4. Post-processing: Alerting, triggering, final filters

Answer 24

Jupyter is a web-based notebook for running Python, R, Scala and other languages to process and visualize data, perform statistical analysis, and train and run machine learning models

Answer 25

DynamoDB S3 Elastisearch Redshift

Answer 26

1. STL_LOAD_ERRORS | 2. STL_LOADERROR_DETAIL

Answer 27

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.

Answer 28

Redshift - Encryption in transit 1. Create parameter group 2. SSL Certificate

Answer 29

1. Fraud Detection 2. Customer Service 3. Litigation/Legal 4. Security 5. Healthcare 6. Sports

Answer 30

Buffer Size - 1 MB - 128 MB Buffer Interval - 60-900 Seconds Parameters for transformation: 1. Record ID 2. Result : OK, Dropped & Processing Failed 3. Data

Answer 31

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Answer 32

1. KMS 2. HSM (CloudHSM & On-prem HSM) Note: In Redshift it will encrypt Data Blocks, system Metadata & snapshots

Answer 33

- Tez is an engine to process complex Directed Acyclic Graph (DAG) - It can be used in place of Hadoop for running Pig and Hive - Runs on top of YARN

Answer 34

SQS - Message Queue | Kinesis Streams - Real time processing

Answer 35

1. Use GSI 2. Use Burst /spread periodic batch writes/SQS managed write buffer - In case of uneven writes 3. Use Cashing - In case of uneven Read

Answer 36

1. Petabyte scale data warehouse services 2. OLAP & BI Use cases 3. ANSI SQL Compliance 4. Column Oriented 5. MPP Architecture 6. Node Types: a. Dense Compute (DC1 and DC2) b. Dense Storage (DS2) Single AZ Implementation

Answer 37

Max observation size (target+attributes): 100KB Max training data size: 100GB Max batch predictions data size: 1TB Max batch predictions data records: 100 million Max columns in schema: 1000 Real-time prediction endpoint TPS: 200 Number of classes for multiclass ML models: 100

Answer 38

Kinesis : Key Features 1. Real time data streaming 2. Ordered record delivery 3. Replication to three availability zone 4. de-coupled from consuming application 5. replay data 6. zero downtime scaling 7. pay as you go 8. parallel processing - multiple producers and consumers

Answer 39

In the Hadoop ecosystem, DistCp is often used to move data. DistCp provides a distributed copy capability built on top of a MapReduce framework. S3DistCp is an extension to DistCp that is optimized to work with S3 and that adds several useful features. 1. Copy or move files without transformation 2. Copy and change file compression on the fly 3. Copy files incrementally 4. Copy multiple folders in one job 5. Aggregate files based on a pattern 6. Upload files larger than 1 TB in size 7. Submit a S3DistCp step to an EMR cluster

Answer 40

1. Spark Core - Dispatch & Scheduling tasks 2. Spark SQL - Execute low-latency interactive SQL query against structured data 3. Spark Streaming - Stream processing of live data streams 4. MLib - Scalable Machine Learning Library 5. GraphX - Graphs parallel computation

Answer 41

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tuneable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Answer 42

1. Creation of Destination Cluster 2. Source cluster restart and enter into read-only mode 3. Reconnect to source cluster to run queries in the read-only mode 4. Redshift start copy data from source to target cluster 5. Once copy is over, Redshift updates the DNS endpoint to target cluster 6. Source cluster will be decommissioned

Answer 43

Low rate producers Mobile apps IoT devices Web clients

Answer 44

1. Even Rows distributed across the slice regardless of value in a particular column Default Distribution style 2. Key Distribute data evenly among slices Collocate matching raws in the same slice Improve the performace Use cases - Join tables, larger fact tables All ``` Copy of entire table is stored in all nodes Need more space due to duplication Use Case: - Static Data - Small size of table - No common distribution key ```

Answer 45

No. of data files should be equal to no. of slices or multiple of the no. of slices i.e. 4 slices = 4 files or 8 files 32 slices = 32 files or 64 files File compression - gzip, lzop, bzip2

Answer 46

Amazon Glacier Vault Lock allows you to easily deploy and enforce compliance controls for individual Amazon Glacier vaults with a vault lock policy. Use Cases 1. Time based retention 2. Undeletable

Answer 47

- Mobile app live monitoring - Clickstream analytics - Logs - Metering records - IoT data

Answer 48

1. Manually - Options 1. Terminate at instance hour 2. Terminate at task completion 2. Autoscaling

Answer 49

Bigger nodes are better for long running queries eg: fewer dc1.8xlarge better than more dc1.large More nodes are better for short running queries eg: more dc1.large better than fewer dc1.8xlarge

Answer 50

``` 1. Reboot of RedShift Cluster required to reflect the changes User Group Query Group User Group Wildcard Query Group Wildcard ``` 2. No reboot required for parameters Concurrency % of Memory Timeout

Answer 51

1. Collection - Group of stream records and batching them to reduce HTTP requests 2. Aggregation - Allows you to combine multiple user records into a single stream

Answer 52

In case producer application/use case can't incur an additional processing delay

Answer 53

1. Redshift - Copy command to transfer the data 2. EMR - Using HIVE can read & write data from DynamoDB 3. S3 - Export & Import to S3 4. Datapipeline - Mediator to copy the data to and from S3 5. Lambda - Event based action 6. Kinesis streams - Streaming Data 7. EC2 Instance - Streaming Data

Answer 54

1. Process and analyse logs 2. Join very large tables 3. Batch jobs 4. Ad-hoc interactive queries

Answer 55

Unsupervised Learning: 1. Unlabeled Data 2. No knowledge of output 3. Self guided learning algorithm 4. Aim: to figure out the data patterns and grouping Supervised Learning: 1. Labelled Data 2. Desired outcome is known 3. Providing the algorithm training data to learn from 4. Aim: Predictive Analytics

Answer 56

Sqoop is a tool for data migration between Amazon S3, Hadoop, HDFS, and RDBMS databases including Redshift - Parallel data transfer for faster export and ingestion - Batched transfer, not meant for interactive queries

Answer 57

Redshift Data Model 1. Star Schema - Consist of one or more fact tables referencing any number of dimension tables - Fact Table - Consists of measurements metrics of fact of a business process - Dimension Table- Stores dimensions that describes objects in the fact table

Answer 58

1. EC2 Cluster Nodes a. Open Source HDFS Encryption b. LUKS Encryption 2. Foe EMRFS on S3 a. SSE-S3 b. SSE-KMS c. CSE-KMS d. CSE- Custom

Answer 59

Quicksight Visualizations - 20 visuals per analysis - Quicksight can determine most appropriate visual types for you - Dimensions & Measures (fields)

Answer 60

Lambda Patterns - Real-time file processing - Real-time stream processing - Extract, transform, and load - Replace cron - Process AWS events

Answer 61

- 256 KB Messages - Messages can be retained for 14 Days - Two important Architectures 1. SQS Priority Architecture 2. Fanout Architecture

Answer 62

Controls: 1. Security Groups a. Default & b. EMR Managed 2. IAM Roles - Default Role, EC2 Default Role & Autoscaling Default Role 3. Private Subnet 4. Encryption at Rest 5. Encryption in transmit

Answer 63

Small data sets – Amazon EMR is built for massive parallel processing; if your data set is small enough to run quickly on a single machine, in a single thread, the added overhead to map and reduce jobs may not be worth it for small data sets that can easily be processed in memory on a single system. ACID transaction requirements – While there are ways to achieve ACID (atomicity, consistency, isolation, durability) or limited ACID on Hadoop, another database, such as Amazon RDS or relational database running on Amazon EC2, may be a better option for workloads with stringent requirements. Amazon

Answer 64

Kinesis Streams Features: - Streams receive data from the Producers - Replicate data over multiple availability zones for durability - Distribute data among the provisioned shards

Answer 65

1. Replication factor - 3 times 2. Block Size - 64 MB - 256 MB 3. Replication factor can be configured in hdfs-site.xml 4. Block size and Replication factor set per file

Answer 66

EMR File Formats 1. Text 2. Parquest 3. ORC 4. Sequence 5. AVRO Keep GZIP files 1-2 GB Range Avoid smaller files <100 MB s3Distcp can be used to copy data between S3 HDFS or viceversa

Answer 67

IoT Authentication: 1. X.509 Certificate 2. Cognito Identity

Answer 68

1. Instance Store 2. EBS for HDFS 3. EMRFS - S3 EMRFS & HDFS can be used together Copy data from S3 to HDFS using S3DistCP

Answer 69

1. Data Nodes 2. Activities 3. Preconditions 4. Schedules

Answer 70

1. Automatic - Recommended by AWS 2. Manual Use "Encode" to compress column

Answer 71

- Start off with multiple shards - Have multiple consumers for A/B testing without downtime - Dump data to S3 when possible; it’s cheap and durable - Use the same stream for data archival and analytics - Lambda for transformations and processing - Use logic in consumer if you need only-once delivery; keep state in DynamoDB - Tag streams for cost segregation

Answer 72

A deep copy recreates and repopulates a table by using a bulk insert, which automatically sorts the table. If a table has a large unsorted region, a deep copy is much faster than a vacuum. The trade o􀃗 is that you cannot make concurrent updates during a deep copy operation, which you can do during a vacuum. Options: 1. To perform a deep copy using the original table DDL 2. To perform a deep copy using CREATE TABLE LIKE 3. To perform a deep copy by creating a temporary table and truncating the original table Note: 1st method is preferred over others 2

Answer 73

1. Master Key 2. Cluster Encryption Key 3. Database Encryption Key 4. Data Encryption Key

Answer 74

1. Load required files only 2. Load files from different bucket 3. Load files with different prefix 4. JSON format

Answer 75

1. Spark framework replaces MapReduce framework 2. Spark processing engine will be deployed in each node of cluster 3. Spark SQL can interact with S3 or HDFS

Answer 76

Redshift WLM Features 1. Manages separate queue for long running and short running queries 2. Configure memory allocation to queues 3. Improve performance & expenses

Answer 77

- Producers add data records to Kinesis streams - A data record must contain: 1. Name of the stream 2. Partition Key 3. Data Content - Single data records can be added using the -PutRecord API - Multiple data records can be added at one time using the PutRecords API

Answer 78

Consumes and processes data from an Amazon Kinesis stream KCL Libraries available for Java, Ruby, Node, Go, and a Multi-Lang Implementation with Native Python support Creates a DynamoDB table (with the same name as your application) to manage state Make sure you don’t have name conflicts with any existing DynamoDB table and your app name (same region) Multiple KCLs can seamlessly works on the same or different streams Checkpoints processed records KCLs can load balance among each other Automatically deal with stream scaling like shard splits and merges Key performance indicators of the KCL like records processed (size, count, latency, age) MillisBehindLatest - How far behind the KCL is

Answer 79

Mahout is a machine learning library with tools for clustering, classification, and several types of recommenders, including tools to calculate most similar items or build item recommendations for users

Answer 80

Redshift Important Performance Metrics: 1. Number of nodes, processors or slices 2. Node Types 3. Data Distribution 4. Data Sort Order 5. Dataset size 6. Concurrent Operations 7. Query Structure 8. Code Compilation

Answer 81

YARN ResourceManager http://master-public-dns-name:8088/ YARN NodeManager http://slave-public-dns-name:8042/ Hadoop HDFS NameNode http://master-public-dns-name:50070/ Hadoop HDFS DataNode http://slave-public-dns-name:50075/ Spark HistoryServer http://master-public-dns-name:18080/

Answer 82

Global secondary index (GSI.html) — an index with a partition key and a sort key that can be different from those on the base table. - A global secondary index is considered "global" because queries on the index can span all of the data in the base table, across all partitions. - It can be created any time past table creation - Not shares RCU & WCU with tables LSI - an index that has the same partition key as the base table, but a di􀃗erent sort key. - A local secondary index is "local" in the sense that every partition of a local secondary index is scoped to a base table partition that has the same partition key value. - It can be only created during table creation - Shares RCU & WCU with tables

Answer 83

Kinesis Streams - Shard Capacity: - 1 MB/sec Data Input - 2 MB/sec Data Output - 5 transactions/sec for read - 1000 records/sec for writes

Answer 84

1) PutRecord for single data records | 2) PutRecords for multiple data records

Answer 85

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include a centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. It uses agents to sync policies and users, and plugins that run within the same process as the Hadoop component, for example, NameNode.

Answer 86

1. SSE-S3 2. SSE-KMS 3. CSE-CMK

Answer 87

1. Amazon Kinesis Analytics enables users to run standard SQL queries over live streaming data 2. Readily query Kinesis Streams/Firehose data and export the output to destinations like S3

Answer 88

1. Log Processing/Analytics 2. ETL 3. Clickstream Analytics 4. Machine Learning

Answer 89

Resilient Distributed Dataset (RDD) is core of Spark

Answer 90

- To predict binary outcome - AUC (Area under curve) measures the prediction accuracy of model(0 to 1) - Important parameters (Histogram, cut-off threshold)

Answer 91

- API - Multiple streams - Multithread (for multicore) - Synchronous and asynchronous - Complement to KCL (kinesis client library) - Cloudwatch - records in/out/error

Answer 92

- To generate predictions for multiple classes - F1 score measures quality of a model (0 to 1) - Confusion Matrix is used

Answer 93

S3 | DynamoDB

Answer 94

- It is distributed multi tenant-capable full text search engine - HTTP Web Interface - It can be integrated with Logstash & Kibana ELK Stack 1. Logstash - Data collection & log parsing engine 2. Kibana - Open source data visualization and exploration tool

Answer 95

Long Running - 1. Cluster stays up & running for queries against HBASE 2. Jobs on the cluster run frequently Transient Cluster - 1. Temporary cluster that shuts down after processing 2. Good use case is Batch Job

Answer 96

1. Autograph 2. Bar-chart - Vertical & Horizontal 3. Line Charts - Gross sales by month - Gross sales and net sales by month - Measure of a dimension over a period of time 4. Pivot Table - A way to summarize data 5. Scaler Plot - Two or Three measures of a dimension 6. Tree Map - One to two measure for a domain 7. Pie Chart - Compare values for diff dimensions 8. Heat Map - Identify trends & outliers 9. Story - Create narrative by presenting iteration 10. Dashboard - Read only snapshot of analysis

Answer 97

Cross Region KMS Encrypted Snapshots for KMS encrypted clusters 1. Snapshot encrypted

Answer 98

1. Architecture 2. Distribution Styles 3. Sort Keys 4. Compression 5. Constraints 6. Column Sizing 7. Data Types

Answer 99

Hcatalog is a table storage manager for Hadoop It can store data in any format and make it available to external systems like Hive and Pig It can write files in many formats like RCFile, CSV, JSON, and SequenceFile, and ORC or custom formats

Answer 100

Vacuum helps to recover the space and sort the table. Vacuum Options: Full, Sort, Delete Note: In case of any update and deletion of any row Redshift will not free up the space

Answer 101

Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by Pig Latin, a SQL-like language which allows users to structure, summarize, and query data.

Answer 102

1. Logging & Analysis 2. Distributed document store 3. Realtime application monitoring 4. Clickstream weblog ingestion

Answer 103

1. Single 2. Compound 3. Interleaved

Answer 104

``` - Zeppelin is a web-based notebook that enables interactive data analytics - Ingestion, discovery, analytics, visualization and collaboration - Connectors for HDFS/Hbase/Hive/Spark Flink PostgreSQL/Redshift ElasticSearch ```