Analysis Flashcards

1
Q

Amazon Machine Learning

A
  • Provides visualization tools and wizards to make creating a model easy
  • Fully managed
  • Outdated now
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Amazon Machine Learning Cost Model

A
  • Charged for compute time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Amazon Machine Learning Promises

A
  • No downtime
  • Up to 100GB training data
  • Up to 5 simultaneous jobs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Amazon Machine Learning Anti Pattern

A
  • Terabyte-scale data
  • Unsupported learning tasks
    • sequence prediction
    • unsupervised clustering
    • deep learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

AWS SageMaker

A
  • Build, Train and Deploy models
  • Tensorflow, Apache MXNet
  • GPU accelerated deep learning
  • Scaling effectively unlimited
  • hyperparameter tuning jobs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

AWS SageMaker Security

A
  • Code stored in “ML storage volumes”
  • All artifacts encrypted in transit and at rest
  • API and console secured by SSL
  • KMS integration for SageMaker notebook, training jobs, endpoints
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Deep Learning on EC2 / EMR

A
  • EMR supports Apache MXNet and GPU Instance types
  • Appropriate instance types for deep learning
    • P3 : 8 Tesla V100 GPU
    • P2 : 16 K80 GPU
    • G3 : 4 M60 GPU
  • Deep Learning AMI’s
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

AWS Data Pipeline

A
  • Manages task dependencies
  • Retries and notifies on failures
  • Highly available
  • Destination : S3, RDS, DynamoDB, Redshift, EMR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Kinesis Data Analytics

A
  • Fully managed and serverless
  • Transform, analyze streaming data in real time with Aapche Flink
  • Reference tables are inexpensive to join data for quick lookups
  • Use Flink under the hood
    • Flink is a framework for processing data streams
    • Kinesis Data Analytics integrates Flink with AWS
  • Use Cases : Continuous metric generation, responsive real-time analytics, etc
  • 1KPU = 1 vCPU and 4GB memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Kinesis Data Analytics + Lambda

A
  • Post processing
    • aggregate row, translating to different formats, transforming and enriching data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Kinesis Data Analytics Use Cases

A
  • Streaming ETL
  • Continuous metric generation
  • Responsive analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

RANDOM_CUT_FOREST

A
  • SQL function used for anomaly detection on numeric columns in a stream
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Amazon Opensearch Service (Formerly ElasticSearch)

A
  • A fork of ElasticSearch and Kibana
  • A search engine
  • Fully managed
  • Scale up and down without downtime
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

OpenSearch Use Cases

A
  • Full text search
  • Log analytics
  • Application monitoring
  • Security analytics
  • Clickstream analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

OpenSearch Concepts

A
  • Documents
    • docs are hashed to a particular shard
  • Indices
    • Index has primary shard and 2 replicas
    • Application should make request round-robin amongst nodes
  • Write requests are routed to primary shard, then replicated
  • Read requests are routed to primary or any replicas
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

OpenSearch Options

A
  • Dedicated master node(s)
  • Choice of count and instance types
  • Domains
  • Zone Awareness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

OpenSearch Cold Warm UltraWarm Hot Storage

A
  • Standard data use “hot” storage
    • instance stores or EBS volumes
  • UltraWarm and Warm storage uses S3+caching
  • Cold Storage
    • Use s3
    • Must have dedicated master and have ultrawarm enabled too
  • Data may be migrated between different storage types
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

OpenSearch Index State Management

A
  • Automates index management policies
  • Example
    • delete old indices after a period time
    • move indices from hot -> ultra warm -> warm -> cold storage over time
    • Automate index snapshot
  • ISM policies are run every 30-48 minutes
  • Index rollups
    • periodically roll up old data into summarized indices
    • saves storage costs
    • new index may have fewer fields, coarser time buckets
  • index transform
    • to create a different view to analyze data differently
    • groupings and aggregations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

OpenSearch Cross Cluster Replication

A
  • replicate indices / mappings / metadata across domains
  • replicate data geographically for better latency
  • “follower” index pulls data from “leader” index
    • With cross-cluster replication, we index data to a leader index and OpenSearch replicates that data to one or more read-only follower indices
  • “remote reindex” allows copying indices from one cluster to another on demand
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

OpenSearch Stability

A
  • 3 dedicated master nodes is best
    • avoids “split brain”
  • do not run out of disk space
    • minimum storage requirement is roughly : source data * (1 + num of replicas) * 1.45
  • Choosing the number of shards
  • Choosing instance types
    • at least 3 nodes
    • mostly abour storage requirements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

OpenSearch Security

A
  • resource-based policies
  • identity based policies
  • VPC
  • Cognito
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

OpenSearch Anti Pattern

A
  • OLTP
  • ad-hoc data querying
  • OpenSearch is primarily for search and analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

OpenSearch Performance

A
  • memory pressure in the JVM can result if
    • unbalanced shard allocations across nodes
    • too many shards in a cluster
  • Fewer shards can yield better performance if JVMMemoryPressure errors are encountered
    • delete old or unused indices
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Amazon Athena

A
  • serverless
  • interactive query service for s3 (SQL)
  • Presto under the hood
  • Supports many data formats
    • csv, json, orc, parquet, avro
  • unstructured, semi-structured or structured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Amazon Athena Use Cases
- ad-hoc queries of web logs - querying staging data before loading to redshift - analyze cloudtrail / cloudfront / vpc logs in s3 - integration with Jupyter, Zeppelin, RStudio, QuickSight and other visualization tools
26
Athena Workgroups
- can organize users / teams / apps / workloads into WORKGROUPS - can control query access and track costs by Workgroups - Each workgroup has its own - query history - data limits - iam policies - encryption settings
27
Athena Cost Model
- Pay as you go - $5 per TB scanned - sccessful or cancelled queries count. Failed queries do not count - No charge for DDL (CREATE/ALTER/DROP etc) - Save lots of money by using columnar formats - orc, parquet - save 30-90% and get better performance
28
Athena Security
- Transport Layer Security (TLS) encrypts in-transit between Athena and S3
29
Athena Anti Pattern
- Highly formatted reports / visualization - QuickSight better - ETL - use Glue instead
30
Athena Optimized Performance
- Use columnar data (orc, parquet) - small number of large files performs better than large number of small files - Use partitions
31
Athena ACID transactions
- Powered by Apache Iceberg - Just add 'table_type' = 'ICEBERG' in create table statement - concurrent users can safely make row-level modifications - compatible with EMR, Spark, anything that supports Icebery format - removes need for custom record locking - time travel operations
32
Amazon Redshift
- Fully managed, petabyte scale data warehouse - Designed for OLAP not OLTP - Cost effective - SQL, ODBC, JDBC interfaces - Scale up or down on demand - Built in replication and backups - Monitoring via CloudWatch / CloudTrail - Query exabytes of unstructured data in S3 without loading - limitless concurrency - Horizontal scaling - Separate compute and storage resources - Wide variety of data formats - Support of Gzip and Snappy compression
33
Redshift Use Cases
- Accelerate analytics workloads - Unified data warehouse and data lake - Data warehouse modernization - Analyze global sales data - Store historical stock trade data - Analyze ad impressions and clicks - Aggregate gaming data - Analyze social trends
34
Redshift Performance
- Massively Parallel Processing - Columnar Data Storage - Column Compression
35
Redshift Durability
- Replication within cluster - Backup to S3 (Asynchronously replicated to antoher region) - Automated snapshots - Failed drives / nodes automatically replaced - However, limited to a single availability zone
36
Redshift Scaling
- vertical and horizontal scaling on demand - during scaling - a new cluster is created while your old one remains available for reads - CNAME is flipped to new cluster (a few mins of downtime) - data moved in parallel to new compute nodes - concurrency scaling - automatically adds cluster capacity to handle increase in concurrent read queries - support virtually unlimited concurrent users and queries
37
Redshift Distribution Styles
- AUTO (Redshift figures it out based on size of data) - EVEN (rows distributed across slices in round-robin) - KEY (rows distributed based on one column) - ALL (entire table is copied to every node)
38
Redshift Sort Key
- rows are stored on disk in sorted order based on the column you designate as a sort key - like an index - makes for fast range queries - choosing a sort key - single vs compound vs interleaved
39
Redshift Importing Exporting Data
- COPY command - parallelized and efficient - from s3, emr, DynamoDB, remote host - S3 requires a manifest file and IAM role - UNLOAD command - unload from a table into files in S3
40
Redshift COPY Command
- Use COPY to load large amounts of data from outside of Redshift - If your data is already in Redshift in another table, - use INSERT INTO ... SELECT - or CREATE TABLE AS - COPY can decrypt data as it is loaded from S3 - hardware-accelerated SSL used to keep it fast - gzip, lzop and bzip2 compression supported to speed it up further - automatic compression option - analyze data and figures out optimal compression scheme for storing it - Special Use Case : Narrow tables (lots of row, few columns) - load with a single COPY transaction if possible - otherwise hidden metadata columns consume too much space
41
Redshift DBLINK
- Connect Redshift to PostgreSQL - Good way to copy and sync data between PostgreSQL and Redshift
42
Redshift Workload Management
- Prioritize short, fast queries vs long, slow queries - Creates up to 8 queues - default 5 queues with even memory allocation - configuring query queue - priority - concurrency scaling mode - user groups - query groups - query monitoring rules
43
Redshift Manual Workload Management
- One default queue with concurrency level of 5 (5 queries at once) - Superuser queue with concurrency level 1 - Define up to 8 queues, up to concurrency level 50
44
Redshift Short Query Acceleration (SQA)
- Prioritize short-running queries over long running ones - Short queries run in a dedicated space, won't wait in queue behind long queries - Can be used in place of WLM queues for short queries - con configure how many seconds is short
45
Redshift Resizing Clusters
- Elastic Resize - quickly add or remove nodes of same type - cluster is down for a few mins - Classic Resize - change node type or number of nodes - cluster is read-only for hours to days - Snapshot, restore, resize - used to keep cluster available during a classic resize
46
Redshift VACUUM
- recovers space from deleted rows - VACUUM FULL - Sorts the specified table and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations - VACUUM DELETE ONLY - Reclaims disk space without sorting - VACUUM SORT ONLY - Sort specified table without reclaiming disk space - VACUUM REINDEX - Analyze distribution of sort key then performs a full VACUUM
47
Redshift New Features
- RA3 nodes with managed storage - enable independent scaling of compute and storage - ssd based - redshift data lake export - unload Redshift query to s3 in Apache Parquet format - parquet is 2x faster to unload and consumes up to 6x less storage - spatial data types
48
Redshift AQUA
- Advanced query accelerator - pushes reduction and aggregation queries closer to the data - up to 10x faster, no extra cost, no code changes - benefits from high-bandwidth connection to s3
49
Redshift Anti Pattern
- small data sets - OLTP - unstructured data - BLOB data
50
Redshift Security
- Using a Hardware Security Module (HSM) - must use a client and server certificate to configure a trusted connection between Redshift and HSM
51
Redshift Serverless
- Automatic scaling and provisioning for your workload - Optimizes costs and performance - Uses ML to maintain performance across variable and sporadic workloads - Easy spin up dev and test env - Easy ad-hoc business analysis
52
Redshift Monitoring
- Monitoring views - SYS_QUERY_HISTORY - SYS_LOAD_HISTORY - SYS_SERVERLESS_USAGE - CloudWatch logs - CloudWatch metrics
53
Amazon RDS
- Hosted relational database - Aurora, MySQL, PostgreSQL, Oracle, etc - Not for big data
54
RDS ACID
- Atomicity - Consistency - Isolation - Durability
55
Amazon Aurora
- MySQL and PostgreSQL compatible - up to 5x faster than MySQL, 3x faster than PostgreSQL - 1/10 the cost of commercial database - Up to 64TB per database instance - Up to 15 read replicas - Continuous backup to s3 - Replication across availability zones - Automatic scaling with Aurora Serverless
56
Aurora Security
- VPC - EAR : KMS - EIF : SSL
57
Amazon QuickSight
- Business analytics service - allows all users - build visualization - perform ad-hoc analysis - quickly get business insights from data - serverless
58
QuickSight SPICE
- Data sets are imported into SPICE - super-fast, parallel, in-memory calculation engine - user columnar storage, in-memory, machine code generation - accelerates interactive queries on large data sets - each user gets 10GB of SPICE - highly available or durable - scales to hundreds of thousands of users
59
QuickSight Use Cases
- Interactive ad-hoc exploration / visualization of data - dashboard and KPI's - Analyze / visualize data from - logs in s3 - on-premise databases - AWS (RDS, Redshift, Athena, S3) - SaaS applications such as Salesforce
60
QuickSight Anti Pattern
- highly formatted canned reports - ETL
61
QuickSight Security
- VPC - Multi-Factor Authentication - Row-level security - Column-level security (Enterprise edition only)
62
QuickSight + Redshift Security
- By default QuickSight can only access data stored in the same region as one QuickSight is running within - Problem : QuickSight in region A, Redshift in region B - Solution : create a new security group with an inbound rule authorizing access from the IP range of QuickSight servers in that region
63
QuickSight User Management
- Users defined via IAM or email signup - Active Directory connector with QuickSight Enterprise Edition
64
QuickSight Pricing
- Annual Subscription - Standard : $9 / month / user - Enterprise $18 / month / user - Extra SPICE capacity - $0.25 (standard) 0.38(Enterprise) /GB /user /month
65
QuickSight Dashboards
- read only snapshots of an analysis - can share with others with QuickSight access - can share even more widely with embedded dashboards - embed within an application
66
QuickSight Machine Learning Insights
- ML powered anomaly detection - ML powered forecasting - Autonarratives