AWS Big Data Speciality Flashcards
Spark Patterns and Anti Patterns
Spark Patterns:
- High performance fast engine for processing large amounts of data (In-memory, Disk)
- Faster then running queries in HIVE
- Run queries against live data
- Flexibility in terms of languages
Spark Anti Patterns:
- It is not designed for OLTP
- Not fit for batch processing
- Avoid large multi-user reporting environment with high concurrency
Kinesis Retention Periods
24 Hours to 7 Days
Default is 24 Hours
EMR Consistent View
EMRFS consistent view monitors Amazon S3 list consistency for objects written by or synced with EMRFS, delete consistency for objects deleted by EMRFS, and read-after-write consistency for new objects written by EMRFS.
You can configure additional settings for consistent view by providing them for the
/home/hadoop/conf/emrfs-site.xml
DynamoDB Max number of LSI
5
Kinesis Firehose Handling
- S3 - Retries delivery up to 24 hours
2. Redshift & ElastiSearch : 0-7200 Seconds
Apache Hadoop Modules
Apache Hadoop Modules
- Hadoop Common
- HDFS
- YARN
- MapReduce
Impala
Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS).
Kinesis Consumers
Read data from streams:
- for further processing
- data store delivery
Kinesis Streams
Kinesis Streams:
- Receive data from the Producers
- Replicate data over multiple availability zones for durability
- Distribute data among the provisioned shards
EMR Data Compression Formats
Algorithm/Splittable/Comp. Ratio/Co-De Speed
- GZIP/No/High/Medium
- bzip2/Yes/Very High/Slow
- LZO/Yes/Low/Fast
- Snappy/No/Low/Very Fast
Presto - Patterns and Anti-Patterns
Presto Patterns:
- Query different types of data sources - Relational Database, Nosql, HIVE framework, kafka stream processing
- High concurrency
- In-memory processing
Presto Anti-patterns:
- Not fit for Batch Processing
- Not designed for OLTP
- Not fit for large join operations
KPL - Key Concepts
- Include library and use
- Can write to multiple Amazon Kinesis streams
- Error recovery built-in: Retry mechanisms
- Synchronous and asynchronous writing
- Multithreading
- Complement to the Amazon Kinesis Client Library (KCL)
- CloudWatch Integration –Records In/Out/Error
- Batches data records to increase payload size and improve throughput
- Aggregation – multiple data records sent in one transaction; increasing the numbers of records sent per API call
- Collection – takes multiple aggregated records from the previous step and sends them as one HTTP request; further optimizing the data transfer by reducing HTTP request overhead
Resizing EMR Cluster
- Only task nodes can be resized up or down
- Only one master, cannot change that
- Core nodes can only be added
- Even with EMRFS, core nodes have HDFS for processing
- Add task nodes, task node groups when more processing is needed
Redshift Important Operations
Redshift important operations:
- Launch
- Resize
- Vacuum
- Backup & Restore
- Monitoring
DynamoDB Performance Metrics
1 Partition = 10 GB = 3000 RCU & 1000 WCU
RCU - 4KB/sec
WCU- 1 KB/sec
DynamoDB Streams Configuration Views
- KEYS_ONLY
- NEW_IMAGE
- OLD_IMAGE
- NEW_AND_OLD IMAGES
KPL Use Cases
- High rate producers
- Record aggregation
Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
- Coordinates distributed processing
Regression Model
- To predict a numerical value
- RMSE number measures quality of a model
- Lower RMSE better predictions
- RMSE - Root-Mean-Square-Error
Use Cases
- Determine what your house is worth ?
- How many units of product will call ?
Kinesis Agent
- Real-time Kinesis file mediation client written in Java
- Streams files/tails files
- Handles file rotation, check pointing and retry upon failure
- Multiple folders/files to multiple streams
- Transform data prior to streaming: SINGLELINE, CSVTOJSON, LOGTOJSON
- CloudWatch- BytesSent, RecordSendAttempts, RecordSendErrors, ServiceErrors
Kinesis Firehose Destination Data Delivery
- S3
- ElastiSearch
- RedShift
Machine Learning Algorithms
- Supervised Learning - Trained
a. Classification - Is this transaction fraud?
b. Regression - Customer life time value - Unsupervised Learning - Self Learning
a. Clustering - Market Segmentation
EMR Cluster sizing
- Master Node -
m3. xlarge - < 50 nodes, m3.2xlarge >50 nodes
2. Core Nodes - Replication Factor >10 Node cluster - 3 4-9 Node cluster -2 3 Node cluster - 1
HDFS Capacity Formula=
Data Size = Total Storage/Replication Factor
Note: AWS recommends smaller cluster of larger nodes
DynamoDB Performance
DynamoDB Performance
- partitions = Desired RCU/3000 + Desired WCU/1000
- partitions= Data size in GB/10 GB