Analysis Flashcards

1
Q

Amazon Machine Learning

A
  • Provides visualization tools and wizards to make creating a model easy
  • Fully managed
  • Outdated now
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Amazon Machine Learning Cost Model

A
  • Charged for compute time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Amazon Machine Learning Promises

A
  • No downtime
  • Up to 100GB training data
  • Up to 5 simultaneous jobs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Amazon Machine Learning Anti Pattern

A
  • Terabyte-scale data
  • Unsupported learning tasks
    • sequence prediction
    • unsupervised clustering
    • deep learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

AWS SageMaker

A
  • Build, Train and Deploy models
  • Tensorflow, Apache MXNet
  • GPU accelerated deep learning
  • Scaling effectively unlimited
  • hyperparameter tuning jobs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

AWS SageMaker Security

A
  • Code stored in “ML storage volumes”
  • All artifacts encrypted in transit and at rest
  • API and console secured by SSL
  • KMS integration for SageMaker notebook, training jobs, endpoints
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Deep Learning on EC2 / EMR

A
  • EMR supports Apache MXNet and GPU Instance types
  • Appropriate instance types for deep learning
    • P3 : 8 Tesla V100 GPU
    • P2 : 16 K80 GPU
    • G3 : 4 M60 GPU
  • Deep Learning AMI’s
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

AWS Data Pipeline

A
  • Manages task dependencies
  • Retries and notifies on failures
  • Highly available
  • Destination : S3, RDS, DynamoDB, Redshift, EMR
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Kinesis Data Analytics

A
  • Fully managed and serverless
  • Transform, analyze streaming data in real time with Aapche Flink
  • Reference tables are inexpensive to join data for quick lookups
  • Use Flink under the hood
    • Flink is a framework for processing data streams
    • Kinesis Data Analytics integrates Flink with AWS
  • Use Cases : Continuous metric generation, responsive real-time analytics, etc
  • 1KPU = 1 vCPU and 4GB memory
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Kinesis Data Analytics + Lambda

A
  • Post processing
    • aggregate row, translating to different formats, transforming and enriching data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Kinesis Data Analytics Use Cases

A
  • Streaming ETL
  • Continuous metric generation
  • Responsive analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

RANDOM_CUT_FOREST

A
  • SQL function used for anomaly detection on numeric columns in a stream
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Amazon Opensearch Service (Formerly ElasticSearch)

A
  • A fork of ElasticSearch and Kibana
  • A search engine
  • Fully managed
  • Scale up and down without downtime
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

OpenSearch Use Cases

A
  • Full text search
  • Log analytics
  • Application monitoring
  • Security analytics
  • Clickstream analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

OpenSearch Concepts

A
  • Documents
    • docs are hashed to a particular shard
  • Indices
    • Index has primary shard and 2 replicas
    • Application should make request round-robin amongst nodes
  • Write requests are routed to primary shard, then replicated
  • Read requests are routed to primary or any replicas
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

OpenSearch Options

A
  • Dedicated master node(s)
  • Choice of count and instance types
  • Domains
  • Zone Awareness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

OpenSearch Cold Warm UltraWarm Hot Storage

A
  • Standard data use “hot” storage
    • instance stores or EBS volumes
  • UltraWarm and Warm storage uses S3+caching
  • Cold Storage
    • Use s3
    • Must have dedicated master and have ultrawarm enabled too
  • Data may be migrated between different storage types
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

OpenSearch Index State Management

A
  • Automates index management policies
  • Example
    • delete old indices after a period time
    • move indices from hot -> ultra warm -> warm -> cold storage over time
    • Automate index snapshot
  • ISM policies are run every 30-48 minutes
  • Index rollups
    • periodically roll up old data into summarized indices
    • saves storage costs
    • new index may have fewer fields, coarser time buckets
  • index transform
    • to create a different view to analyze data differently
    • groupings and aggregations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

OpenSearch Cross Cluster Replication

A
  • replicate indices / mappings / metadata across domains
  • replicate data geographically for better latency
  • “follower” index pulls data from “leader” index
    • With cross-cluster replication, we index data to a leader index and OpenSearch replicates that data to one or more read-only follower indices
  • “remote reindex” allows copying indices from one cluster to another on demand
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

OpenSearch Stability

A
  • 3 dedicated master nodes is best
    • avoids “split brain”
  • do not run out of disk space
    • minimum storage requirement is roughly : source data * (1 + num of replicas) * 1.45
  • Choosing the number of shards
  • Choosing instance types
    • at least 3 nodes
    • mostly abour storage requirements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

OpenSearch Security

A
  • resource-based policies
  • identity based policies
  • VPC
  • Cognito
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

OpenSearch Anti Pattern

A
  • OLTP
  • ad-hoc data querying
  • OpenSearch is primarily for search and analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

OpenSearch Performance

A
  • memory pressure in the JVM can result if
    • unbalanced shard allocations across nodes
    • too many shards in a cluster
  • Fewer shards can yield better performance if JVMMemoryPressure errors are encountered
    • delete old or unused indices
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Amazon Athena

A
  • serverless
  • interactive query service for s3 (SQL)
  • Presto under the hood
  • Supports many data formats
    • csv, json, orc, parquet, avro
  • unstructured, semi-structured or structured
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Amazon Athena Use Cases

A
  • ad-hoc queries of web logs
  • querying staging data before loading to redshift
  • analyze cloudtrail / cloudfront / vpc logs in s3
  • integration with Jupyter, Zeppelin, RStudio, QuickSight and other visualization tools
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Athena Workgroups

A
  • can organize users / teams / apps / workloads into WORKGROUPS
  • can control query access and track costs by Workgroups
  • Each workgroup has its own
    • query history
    • data limits
    • iam policies
    • encryption settings
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Athena Cost Model

A
  • Pay as you go
    • $5 per TB scanned
    • sccessful or cancelled queries count. Failed queries do not count
    • No charge for DDL (CREATE/ALTER/DROP etc)
  • Save lots of money by using columnar formats
    • orc, parquet
    • save 30-90% and get better performance
28
Q

Athena Security

A
  • Transport Layer Security (TLS) encrypts in-transit between Athena and S3
29
Q

Athena Anti Pattern

A
  • Highly formatted reports / visualization
    • QuickSight better
  • ETL
    • use Glue instead
30
Q

Athena Optimized Performance

A
  • Use columnar data (orc, parquet)
  • small number of large files performs better than large number of small files
  • Use partitions
31
Q

Athena ACID transactions

A
  • Powered by Apache Iceberg
    • Just add ‘table_type’ = ‘ICEBERG’ in create table statement
  • concurrent users can safely make row-level modifications
  • compatible with EMR, Spark, anything that supports Icebery format
  • removes need for custom record locking
  • time travel operations
32
Q

Amazon Redshift

A
  • Fully managed, petabyte scale data warehouse
  • Designed for OLAP not OLTP
  • Cost effective
  • SQL, ODBC, JDBC interfaces
  • Scale up or down on demand
  • Built in replication and backups
  • Monitoring via CloudWatch / CloudTrail
  • Query exabytes of unstructured data in S3 without loading
  • limitless concurrency
  • Horizontal scaling
  • Separate compute and storage resources
  • Wide variety of data formats
  • Support of Gzip and Snappy compression
33
Q

Redshift Use Cases

A
  • Accelerate analytics workloads
  • Unified data warehouse and data lake
  • Data warehouse modernization
  • Analyze global sales data
  • Store historical stock trade data
  • Analyze ad impressions and clicks
  • Aggregate gaming data
  • Analyze social trends
34
Q

Redshift Performance

A
  • Massively Parallel Processing
  • Columnar Data Storage
  • Column Compression
35
Q

Redshift Durability

A
  • Replication within cluster
  • Backup to S3 (Asynchronously replicated to antoher region)
  • Automated snapshots
  • Failed drives / nodes automatically replaced
  • However, limited to a single availability zone
36
Q

Redshift Scaling

A
  • vertical and horizontal scaling on demand
  • during scaling
    • a new cluster is created while your old one remains available for reads
    • CNAME is flipped to new cluster (a few mins of downtime)
    • data moved in parallel to new compute nodes
  • concurrency scaling
    • automatically adds cluster capacity to handle increase in concurrent read queries
    • support virtually unlimited concurrent users and queries
37
Q

Redshift Distribution Styles

A
  • AUTO (Redshift figures it out based on size of data)
  • EVEN (rows distributed across slices in round-robin)
  • KEY (rows distributed based on one column)
  • ALL (entire table is copied to every node)
38
Q

Redshift Sort Key

A
  • rows are stored on disk in sorted order based on the column you designate as a sort key
  • like an index
  • makes for fast range queries
  • choosing a sort key
    • single vs compound vs interleaved
39
Q

Redshift Importing Exporting Data

A
  • COPY command
    • parallelized and efficient
    • from s3, emr, DynamoDB, remote host
    • S3 requires a manifest file and IAM role
  • UNLOAD command
    • unload from a table into files in S3
40
Q

Redshift COPY Command

A
  • Use COPY to load large amounts of data from outside of Redshift
  • If your data is already in Redshift in another table,
    • use INSERT INTO … SELECT
    • or CREATE TABLE AS
  • COPY can decrypt data as it is loaded from S3
    • hardware-accelerated SSL used to keep it fast
  • gzip, lzop and bzip2 compression supported to speed it up further
  • automatic compression option
    • analyze data and figures out optimal compression scheme for storing it
  • Special Use Case : Narrow tables (lots of row, few columns)
    • load with a single COPY transaction if possible
    • otherwise hidden metadata columns consume too much space
41
Q

Redshift DBLINK

A
  • Connect Redshift to PostgreSQL
  • Good way to copy and sync data between PostgreSQL and Redshift
42
Q

Redshift Workload Management

A
  • Prioritize short, fast queries vs long, slow queries
  • Creates up to 8 queues
    • default 5 queues with even memory allocation
  • configuring query queue
    • priority
    • concurrency scaling mode
    • user groups
    • query groups
    • query monitoring rules
43
Q

Redshift Manual Workload Management

A
  • One default queue with concurrency level of 5 (5 queries at once)
  • Superuser queue with concurrency level 1
  • Define up to 8 queues, up to concurrency level 50
44
Q

Redshift Short Query Acceleration (SQA)

A
  • Prioritize short-running queries over long running ones
  • Short queries run in a dedicated space, won’t wait in queue behind long queries
  • Can be used in place of WLM queues for short queries
  • con configure how many seconds is short
45
Q

Redshift Resizing Clusters

A
  • Elastic Resize
    • quickly add or remove nodes of same type
    • cluster is down for a few mins
  • Classic Resize
    • change node type or number of nodes
    • cluster is read-only for hours to days
  • Snapshot, restore, resize
    • used to keep cluster available during a classic resize
46
Q

Redshift VACUUM

A
  • recovers space from deleted rows
  • VACUUM FULL
    • Sorts the specified table and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations
  • VACUUM DELETE ONLY
    • Reclaims disk space without sorting
  • VACUUM SORT ONLY
    • Sort specified table without reclaiming disk space
  • VACUUM REINDEX
    • Analyze distribution of sort key then performs a full VACUUM
47
Q

Redshift New Features

A
  • RA3 nodes with managed storage
    • enable independent scaling of compute and storage
    • ssd based
  • redshift data lake export
    • unload Redshift query to s3 in Apache Parquet format
    • parquet is 2x faster to unload and consumes up to 6x less storage
  • spatial data types
48
Q

Redshift AQUA

A
  • Advanced query accelerator
  • pushes reduction and aggregation queries closer to the data
  • up to 10x faster, no extra cost, no code changes
  • benefits from high-bandwidth connection to s3
49
Q

Redshift Anti Pattern

A
  • small data sets
  • OLTP
  • unstructured data
  • BLOB data
50
Q

Redshift Security

A
  • Using a Hardware Security Module (HSM)
    • must use a client and server certificate to configure a trusted connection between Redshift and HSM
51
Q

Redshift Serverless

A
  • Automatic scaling and provisioning for your workload
  • Optimizes costs and performance
  • Uses ML to maintain performance across variable and sporadic workloads
  • Easy spin up dev and test env
  • Easy ad-hoc business analysis
52
Q

Redshift Monitoring

A
  • Monitoring views
    • SYS_QUERY_HISTORY
    • SYS_LOAD_HISTORY
    • SYS_SERVERLESS_USAGE
  • CloudWatch logs
  • CloudWatch metrics
53
Q

Amazon RDS

A
  • Hosted relational database
    • Aurora, MySQL, PostgreSQL, Oracle, etc
  • Not for big data
54
Q

RDS ACID

A
  • Atomicity
  • Consistency
  • Isolation
  • Durability
55
Q

Amazon Aurora

A
  • MySQL and PostgreSQL compatible
  • up to 5x faster than MySQL, 3x faster than PostgreSQL
  • 1/10 the cost of commercial database
  • Up to 64TB per database instance
  • Up to 15 read replicas
  • Continuous backup to s3
  • Replication across availability zones
  • Automatic scaling with Aurora Serverless
56
Q

Aurora Security

A
  • VPC
  • EAR : KMS
  • EIF : SSL
57
Q

Amazon QuickSight

A
  • Business analytics service
  • allows all users
    • build visualization
    • perform ad-hoc analysis
    • quickly get business insights from data
  • serverless
58
Q

QuickSight SPICE

A
  • Data sets are imported into SPICE
    • super-fast, parallel, in-memory calculation engine
    • user columnar storage, in-memory, machine code generation
    • accelerates interactive queries on large data sets
  • each user gets 10GB of SPICE
  • highly available or durable
  • scales to hundreds of thousands of users
59
Q

QuickSight Use Cases

A
  • Interactive ad-hoc exploration / visualization of data
  • dashboard and KPI’s
  • Analyze / visualize data from
    • logs in s3
    • on-premise databases
    • AWS (RDS, Redshift, Athena, S3)
    • SaaS applications such as Salesforce
60
Q

QuickSight Anti Pattern

A
  • highly formatted canned reports
  • ETL
61
Q

QuickSight Security

A
  • VPC
  • Multi-Factor Authentication
  • Row-level security
  • Column-level security (Enterprise edition only)
62
Q

QuickSight + Redshift Security

A
  • By default QuickSight can only access data stored in the same region as one QuickSight is running within
  • Problem : QuickSight in region A, Redshift in region B
  • Solution : create a new security group with an inbound rule authorizing access from the IP range of QuickSight servers in that region
63
Q

QuickSight User Management

A
  • Users defined via IAM or email signup
  • Active Directory connector with QuickSight Enterprise Edition
64
Q

QuickSight Pricing

A
  • Annual Subscription
    • Standard : $9 / month / user
    • Enterprise $18 / month / user
  • Extra SPICE capacity
    • $0.25 (standard) 0.38(Enterprise) /GB /user /month
65
Q

QuickSight Dashboards

A
  • read only snapshots of an analysis
  • can share with others with QuickSight access
  • can share even more widely with embedded dashboards
    • embed within an application
66
Q

QuickSight Machine Learning Insights

A
  • ML powered anomaly detection
  • ML powered forecasting
  • Autonarratives