Processing Flashcards

1
Q

Lambda

A
  • Run code snippets in the cloud
  • Serverless
  • Continuous Scaling
  • Support multiple programming languages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Lambda Use Cases

A
  • Real time file processing
  • Real time stream processing
  • ETL
  • Cron replacement
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Lambda Supported Languages

A
  • Node.js
  • Python
  • Java
  • Go
  • Ruby
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Lambda Triggers

A
  • S3
  • KDS
  • SNS
  • SQS
  • AWS CloudWatch
  • AWS CloudFormation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Lambda Redshift

A
  • Best practice for loading data into Redshift is the COPY command
  • Use DynamoDB to keep track of what has been loaded
  • Lambda can batch up new data and load them with COPY
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Lambda + Kinesis

A
  • Lambda receives an event with a batch of stream records
    • Specify a batch size (up to 10000 records)
    • Batches may be split beyong Lambda’s payload limit (6MB)
  • Lambda will retry the batch until it succeeds or the data expires
    • This can stall the shard if you do not handle errors properly
    • Use more shards to ensure processing isn’t totally held up by errors
  • Lambda processes shard data synchronously
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Lambda Cost Model

A
  • Generous free tier (1M/month, 400K GB-seconds compute time)
  • $0.2 / million requests
  • $0.00001667 per GB / second
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Lambda Promises

A
  • High availability
    • No scheduled downtime
    • Retries failed code 3 times
  • Unlimited scalability
    • Safety throttle of 1000 concurrent execution per region
  • High Performance
    • New functions callable in seconds
    • Code is cached automatically
    • Max Timeout is 15 mins
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Lambda Auti-Pattern

A
  • Long-running applications
    • Use EC2 instead or chain functions
  • Dynamic Websites
  • Stateful applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

AWS Glue

A
  • Discovery of table schema
    • S3
    • RDS
    • Redshift
    • DynamoDB
  • Fully managed
  • Serverless
  • Event-trigger or on a schedule
  • Use Cases
    • Data Discovery Schema
    • Data Catalog
    • Data Transformation (AWS Glue Studio)
    • Data Replication (AWS Glue Elastic View)
    • Data Preparation (AWS Glue DataBrew)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Glue Crawler / Data Catalog

A
  • Glue crawler scans data in S3, creates schema
  • Run periodically
  • Populates the Glue Data Catalog
    • Stores only table definition
    • Original data stays in S3
  • Once cataloged, you can treat your unstructured data like it is structured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Glue and S3 Partition

A
  • Glue crawler will extract partitions based on how your S3 data is organized
  • Organize
    • yyyy/mm/dd/device_id
    • device_id/yyyy/mm/dd
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Glue and Hive

A
  • Hive lets your run SQL-like queries from EMR
  • The Glue Data Catalog can serve as a Hive “metastore”
  • We can import a Hive metastore into Glue
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Glue ETL

A
  • Transform, Clean and Enrich data before data analysis
    • Generate ETL code (we can modify it)
    • We can provide our own Spark or PySpark scripts
    • Target can be S3, JDBC or in Glue Data Catalog
  • Fully managed
  • Scala or Python
  • Can be event-driven or scheduled
  • Can provision addition DPU to increase performance of underlying Spark jobs
  • Batch-oriented, and you can schedule your ETL jobs at a minimum of 5-minute intervals
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Glue DynamicFrame

A
  • DynamicFrame is a collection of DynamicRecords
    • DynamicRecords are self-describing and have schema
    • Similar to Spark DataFrame
    • Scala and Python APIs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Glue ETL Transformations

A
  • Bundled Transformations
    • DropFields : DropNullFields (remove null fields)
    • Filter, Join Map
  • Machine Learning Transformation
    • FindMatches ML : Identify duplicate or matching records in your dataset
  • Format Conversions : CSV, JSON, Avro, Parquet, ORC, XML
  • Apache Spark transformation (example : KMeans)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Glue ETL : Modifying the Data Catalog

A
  • ETL scripts can update your schema and partitions if necessary
  • Updating table schema (when there is a new partition)
    • Re-run the crawler or
    • Use enableUpdateCatalog / updateBehavior from script
  • Restrictions
    • S3 only
    • Json, CSV, avro, parquet only
    • Parquet requires special code
    • Nested schemas are not supported
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Glue Development Endpoints

A
  • Develop ETL scripts using a Notebook
  • Endpoint is in a VPC controlled by security groups, connect via
    • Apache Zeppelin, Notebook, Sagemarker notebook, terminal window
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Running Glue Jobs

A
  • Time-based schedules (cron styles)
  • Job bookmarks
    • Persists state from the job run
    • Prevents reprocessing of old data
    • Allows you to process new data only when re-running on a schedule
    • Works with S3 sources in a variety of formats
    • Works with relational databases via JDBC
  • CloudWatch Events
    • Fire off a Lambda function or SNS notification when ETL succeeds or fails
    • Invoke EC2 run, send event to Kinesis, activate a Step Function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Glue Cost Model

A
  • Billed by the second for crawler and ETL jobs
  • First million objects stored and accesses are free for the Glue Data Catalog
  • Development endpoints for developing ETL code charged by the minute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Glue Anti-Patterns

A
  • non-spark engine
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

AWS Glue Studio

A
  • Visual interface for ETL workflows
  • Visual Job Editior
    • Create DAG’s for complex workflows
    • Sources include S3, Kinesis, Kafka, JDBC
    • Transform / sample / join data
    • Target to S3 or Glue Data Catalog
    • Support partitioning
  • Visual Job Dashboard
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

AWS Glue DataBrew

A
  • Visual data preparation tool
    • UI for preprocessing large data sets
    • Input from S3, data warehouse, or database
    • Output to S3
  • Over 250 ready-made transformations
  • Create “recipes” of transformations that can be saved as jobs within a larger project
  • Define data quality rules
  • Create datasets with custom SQL from Redshift and Snowflake
  • Security
    • Can integrate with KMS
    • SSL in transit
    • IAM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

AWS Glue Elastic Views

A
  • Builds materialized views from Aurora, RDS, DynamoDB
  • Those views can be used by Redshift, ElasticSearch, S3, DynamoDB, Aurora, RDS
  • SQL Interface
  • Handles any copying or combining or replicating data
  • Monitors for changes and continuously updates
  • Serverless
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
AWS Lake Formation
- "Makes it easy to set up a secure data lake in days" - Loading data and monitoring data flows - Setting up partitions - Encryption and managing keys - Defining transformation jobs and monitoring them - Built on top of Glue - Auditing
26
AWS Lake Formation Pricing
- No cost for Lake Formation itself - But underlying services incur charges - Glue - S3 - EMR - Athena - Redshift
27
AWS Lake Formation : The Finer Points
- Cross-account Lake Formation permission - lake Formation does not support manifests in Athena or Redshit queries - IAM needed to create blueprints and workflows
28
AWS Lake Formation : Governed Tables and Security
- Now supports "Governed Tables" that support ACID transactions across multiple tables - Storage Optimization with Automatic Compaction - Granular Access Control with row and cell level security
29
Elastic MapReduce
- Managed Hadoop framework on EC2 instances - Includes Spark, HBase, Presto, Flink, Hive and etc - EMR Notebooks - Several Integration points with AWS
30
EMR Cluster
- Master Node : Manages the cluster - Tracks status of tasks, monitor cluster health - Single EC2 instance - Core Node : Host HDFS data and run tasks - Can be scaled up and down - Multi-node clusters have at least one - Task Node : Run tasks and do not host data - Optional and no risk of data loss when removing - Good use of spot instances
31
EMR Usage
- Transient vs Long-Running Clusters - Transient clusters terminate once all steps are completed - Long running clusters must be manually terminated - basically a datawarhouse with periodic processing on large datasets - Can spin up task nodes using Spot instances for temporary capacity - Can use reserved instance on long running clusters to save money - Termination protection enabled by default - Frameworks and applications are specified at cluster launch - Connect directly to master to run jobs directly - Submit ordered steps via the console
32
EMR and AWS Integration
- EC2 for the instances that comprise the nodes in the cluster - Amazon VPC to configure the virutal network - S3 to as input and output data - CloudWatch to monitor cluster performance and configure alarms - IAM for permissions - CloudTrail to audit requests made to the service - AWS Data Pipeline to schedule and start your cluster
33
EMR Stoarge
- HDFS - Multiple copies stored across cluster instances for redundancy - Files stored as blocks (128MB default size) - Ephemeral : HDFS data is lost when cluster is terminated - Useful for caching intermediate results or workloads - EMRFS : access S3 as if it were HDFS - Allows persistent storage after cluster termination - EMRFS Consistent View - Uses DynamoDB to track consistency - May need to give read write capacity to DynamoDB - Local File System - Suitable only for temporary data (buffers, caches, etc) - EBS for HDFS - Allows use of EMR on EBS-only types (M4, C4) - Deleted when cluster is terminated - EBS volumes can only attached when launching a cluster - If you manually detach an EBS volume, EMR treats that as a failure and replaces it
34
EMR Promises
- Charge by hour + EC2 charges - Provision new nodes if a core node fails - Can add and remove task nodes on the fly - Can resize a running cluster's core nodes - Core nodes can also be added or removed
35
EMR Managed Scaling
- EMR Automatic Scaling - Custom scaling rules based on CloudWatches metrics - Supports instance groups only - EMR Managed Scaling - Support instance groups and instance fleets - Scales spot, on-demand and instances in a Savings Plan within the same cluster - Available for Spark, Hive, YARN workloads - Scale-up Strategy - First adds core nodes, then task nodes, up to max unit specified - Scale-down Strategy - First removes task nodes, then core nodes, no further than minimum constrainst - Spot nodes always removed before on-demand instances
36
EMR Pre-Initialized Capacity
- Spark adds 10% overhead to memory requested for drivers and executors - Be sure initial capacity is at least 10% more than requested by the job
37
EMR Serverless Security
- EMRFS - S3 encryption at rest - TLS EIT - Local disk encryption - Spark communication between drivers and executors is encrypted
38
Hadoop
- MapReduce - Framework for distributed data processing - Maps data to key / value pairs - REduces intermediate results to final output - Yet Another Resource Negotiator (YARN) - Manages cluster resouces for multiple data processing frameworks - HDFS - Distributes data block across cluster in a redundant manner - Ephemeral in EMR - Data lost on termination
39
Apache Spark
- Distributed processing framework for big data - In-memory caching, optimized query execution - Supports Java, R, Python, etc - Supports code reuse across - Batch Processing - Interactive Queries - Real-time Analytics - Machine Learning - Graph Processing - Spark Streaming - Integrated with Kinesis, Kafka, on EMR - Spark is NOT meant for OLTP
40
How Spark Works
- The SparkContext (driver program) coordinates Spark applications - SparkContext works through a Cluster Manager - Executors run computations and store data - SparkContext sends application code and tasks to executors
40
How Spark Works
- The SparkContext (driver program) coordinates Spark applications - SparkContext works through a Cluster Manager - Executors run computations and store data - SparkContext sends application code and tasks to executors
41
Spark Components
- Spark Streaming - Real-time streaming analytics and structured streaming - SparkSQL - Up to 100x faster than MapReduce JDBC, ODBC, JSON, HDFS, ORC, Parquet, HiveQL - MLLib - Classification, Regression, Clustering, Collaborative filtering, pattern mining (Read from HDFS, HBase) - GraphX - Graph Processing, ETL, analysis, Iteractive graph computation - No longer widely used - Spark Core - memory management, fault recovery, scheduling, distribute and monitor jobs, interact with storage - Scala, Python, Java, R
42
Spark Structured Streaming
- Data stream as an unbounded input Table
43
Hive
- Allows users to run sql-like syntax query in EMR - Scalable - Easy OLAP queries - Highly optimized - Highly extensible
44
Hive Metastore
- Hive maintains a "metastore" that imparts a structure you define on the unstructured data that is stored on HDFS
45
External Hive Metastores
- Metastore is stored in MySQL on the master node by default - External metastores offer better resiliency and integration - AWS Glue Data Catalog - Shares schema across EMR and other AWS services - Tie Glue to EMR using the console, CLI, or API - Amazon RDS / Aurora - Need to override default Hive configuration values for external database location
46
Apache Pig
- Pig introduces a scripting language that lets you use SQL-like syntax to define your map and reduce steps - Highly extensible with user-defined functions (UDFs)
47
Apache Pig AWS Integration
- Use multiple file systems - HDFS, S3, etc - Load JAR's and scripts from S3
48
HBase
- Non-relational, petabyte-scale database - Based on Goolge's BigTable and on top of HDFS - In-memory - Hive Integration
49
HBase and AWS Integration
- Can store data (StoreFiles and metadata) on S3 via EMRFS - Can back up to S3
50
Presto
- It can connect to many different "big data" databases and data stores at once and query across them - Interactive queries at petabyte scale - Familiar SQL syntax - Optimized for OLAP - Developed and partially maintained by Facebook - This is what Amazon Athena uses under the hood - Exposes JDBC, Command-Line and Tableau interfaces
51
Apache Zeppelin
- iPython Notebooks environments - Can share notebooks with others on your cluster - Spark, Python, JDBC, HBase, ElasticSearch
52
Apache Zeppelin + Spark
- Run Spark code interactively - speed up your development cycle - allows easy experimentation and exploration of the data - Can execute SQL queries directly against SparkSQL - Query results may be visualized in charts and graphs - Makes Spark fee more like a data science tool
53
HBase Vs DynamoDB
- DynamoDB - Fully managed - More Integration with other AWS services - Glue Integration - HBase - Efficient storage of sparse data - Appropriate for high frequency counters (consistent reads and writes) - High write and update throughput - More integration with Hadoop
54
EMR Notebook
- Similar to Zeppelin with more AWS integrations - Notebooks backed up to S3 - Provision clusters from the notebook - Hosted inside a VPC - Accessed only via AWS console
55
Hue
- Hadoop User Experience - Graphical front-end for applications on your EMR cluster - IAM Integration : Hue Supper-users inherit IAM roles
56
Splunk
- Splunk makes machine learning data accessible, usable and valuable to everyone - Operational tool : can be used to visualize EMR and S3 data using your EMR Hadoop cluster - Reserved instnaces on 64bit OS recommended
57
Flume
- Another way to stream data into your cluster - Made from the start with Hadoop in mind - Originally made to handle log aggregation
58
MXNet
- Like tensorflow, a library for building and accelerating neural networks - Included on EMR
59
S3DistCP
- Tool for copying large amounts of data - from S3 into HDFS or from HDFS to S3 - Uses MapReduce to copy in a distributed manner - Suitable for parallel copying of large numbers of objects - across buckets, across accounts
60
S3DistCP
- Tool for copying large amounts of data - from S3 into HDFS or from HDFS to S3 - Uses MapReduce to copy in a distributed manner - Suitable for parallel copying of large numbers of objects - across buckets, across accounts
61
EMR Security
- IAM - Kerberos - Secure user authN - SSH - Secure connection to command line - Tunneling for web interfaces - Block Public Access - Easy way to prevent public access to data stored on your EMR cluster - Can set at the account level before creating the cluster
62
Kibana + ElasticSearch
- Kibana is an open source data visulazation and exploration tool used for log and time-series analytics, application monitoring, and operational intelligence use cases.