Analysis Flashcards
1
Q
Amazon Machine Learning
A
- Provides visualization tools and wizards to make creating a model easy
- Fully managed
- Outdated now
2
Q
Amazon Machine Learning Cost Model
A
- Charged for compute time
3
Q
Amazon Machine Learning Promises
A
- No downtime
- Up to 100GB training data
- Up to 5 simultaneous jobs
4
Q
Amazon Machine Learning Anti Pattern
A
- Terabyte-scale data
- Unsupported learning tasks
- sequence prediction
- unsupervised clustering
- deep learning
5
Q
AWS SageMaker
A
- Build, Train and Deploy models
- Tensorflow, Apache MXNet
- GPU accelerated deep learning
- Scaling effectively unlimited
- hyperparameter tuning jobs
6
Q
AWS SageMaker Security
A
- Code stored in “ML storage volumes”
- All artifacts encrypted in transit and at rest
- API and console secured by SSL
- KMS integration for SageMaker notebook, training jobs, endpoints
7
Q
Deep Learning on EC2 / EMR
A
- EMR supports Apache MXNet and GPU Instance types
- Appropriate instance types for deep learning
- P3 : 8 Tesla V100 GPU
- P2 : 16 K80 GPU
- G3 : 4 M60 GPU
- Deep Learning AMI’s
8
Q
AWS Data Pipeline
A
- Manages task dependencies
- Retries and notifies on failures
- Highly available
- Destination : S3, RDS, DynamoDB, Redshift, EMR
9
Q
Kinesis Data Analytics
A
- Fully managed and serverless
- Transform, analyze streaming data in real time with Aapche Flink
- Reference tables are inexpensive to join data for quick lookups
- Use Flink under the hood
- Flink is a framework for processing data streams
- Kinesis Data Analytics integrates Flink with AWS
- Use Cases : Continuous metric generation, responsive real-time analytics, etc
- 1KPU = 1 vCPU and 4GB memory
10
Q
Kinesis Data Analytics + Lambda
A
- Post processing
- aggregate row, translating to different formats, transforming and enriching data
11
Q
Kinesis Data Analytics Use Cases
A
- Streaming ETL
- Continuous metric generation
- Responsive analysis
12
Q
RANDOM_CUT_FOREST
A
- SQL function used for anomaly detection on numeric columns in a stream
13
Q
Amazon Opensearch Service (Formerly ElasticSearch)
A
- A fork of ElasticSearch and Kibana
- A search engine
- Fully managed
- Scale up and down without downtime
14
Q
OpenSearch Use Cases
A
- Full text search
- Log analytics
- Application monitoring
- Security analytics
- Clickstream analytics
15
Q
OpenSearch Concepts
A
- Documents
- docs are hashed to a particular shard
- Indices
- Index has primary shard and 2 replicas
- Application should make request round-robin amongst nodes
- Write requests are routed to primary shard, then replicated
- Read requests are routed to primary or any replicas
16
Q
OpenSearch Options
A
- Dedicated master node(s)
- Choice of count and instance types
- Domains
- Zone Awareness
17
Q
OpenSearch Cold Warm UltraWarm Hot Storage
A
- Standard data use “hot” storage
- instance stores or EBS volumes
- UltraWarm and Warm storage uses S3+caching
- Cold Storage
- Use s3
- Must have dedicated master and have ultrawarm enabled too
- Data may be migrated between different storage types
18
Q
OpenSearch Index State Management
A
- Automates index management policies
- Example
- delete old indices after a period time
- move indices from hot -> ultra warm -> warm -> cold storage over time
- Automate index snapshot
- ISM policies are run every 30-48 minutes
- Index rollups
- periodically roll up old data into summarized indices
- saves storage costs
- new index may have fewer fields, coarser time buckets
- index transform
- to create a different view to analyze data differently
- groupings and aggregations
19
Q
OpenSearch Cross Cluster Replication
A
- replicate indices / mappings / metadata across domains
- replicate data geographically for better latency
- “follower” index pulls data from “leader” index
- With cross-cluster replication, we index data to a leader index and OpenSearch replicates that data to one or more read-only follower indices
- “remote reindex” allows copying indices from one cluster to another on demand
20
Q
OpenSearch Stability
A
- 3 dedicated master nodes is best
- avoids “split brain”
- do not run out of disk space
- minimum storage requirement is roughly : source data * (1 + num of replicas) * 1.45
- Choosing the number of shards
- Choosing instance types
- at least 3 nodes
- mostly abour storage requirements
21
Q
OpenSearch Security
A
- resource-based policies
- identity based policies
- VPC
- Cognito
22
Q
OpenSearch Anti Pattern
A
- OLTP
- ad-hoc data querying
- OpenSearch is primarily for search and analytics
23
Q
OpenSearch Performance
A
- memory pressure in the JVM can result if
- unbalanced shard allocations across nodes
- too many shards in a cluster
- Fewer shards can yield better performance if JVMMemoryPressure errors are encountered
- delete old or unused indices
24
Q
Amazon Athena
A
- serverless
- interactive query service for s3 (SQL)
- Presto under the hood
- Supports many data formats
- csv, json, orc, parquet, avro
- unstructured, semi-structured or structured
25
Amazon Athena Use Cases
- ad-hoc queries of web logs
- querying staging data before loading to redshift
- analyze cloudtrail / cloudfront / vpc logs in s3
- integration with Jupyter, Zeppelin, RStudio, QuickSight and other visualization tools
26
Athena Workgroups
- can organize users / teams / apps / workloads into WORKGROUPS
- can control query access and track costs by Workgroups
- Each workgroup has its own
- query history
- data limits
- iam policies
- encryption settings
27
Athena Cost Model
- Pay as you go
- $5 per TB scanned
- sccessful or cancelled queries count. Failed queries do not count
- No charge for DDL (CREATE/ALTER/DROP etc)
- Save lots of money by using columnar formats
- orc, parquet
- save 30-90% and get better performance
28
Athena Security
- Transport Layer Security (TLS) encrypts in-transit between Athena and S3
29
Athena Anti Pattern
- Highly formatted reports / visualization
- QuickSight better
- ETL
- use Glue instead
30
Athena Optimized Performance
- Use columnar data (orc, parquet)
- small number of large files performs better than large number of small files
- Use partitions
31
Athena ACID transactions
- Powered by Apache Iceberg
- Just add 'table_type' = 'ICEBERG' in create table statement
- concurrent users can safely make row-level modifications
- compatible with EMR, Spark, anything that supports Icebery format
- removes need for custom record locking
- time travel operations
32
Amazon Redshift
- Fully managed, petabyte scale data warehouse
- Designed for OLAP not OLTP
- Cost effective
- SQL, ODBC, JDBC interfaces
- Scale up or down on demand
- Built in replication and backups
- Monitoring via CloudWatch / CloudTrail
- Query exabytes of unstructured data in S3 without loading
- limitless concurrency
- Horizontal scaling
- Separate compute and storage resources
- Wide variety of data formats
- Support of Gzip and Snappy compression
33
Redshift Use Cases
- Accelerate analytics workloads
- Unified data warehouse and data lake
- Data warehouse modernization
- Analyze global sales data
- Store historical stock trade data
- Analyze ad impressions and clicks
- Aggregate gaming data
- Analyze social trends
34
Redshift Performance
- Massively Parallel Processing
- Columnar Data Storage
- Column Compression
35
Redshift Durability
- Replication within cluster
- Backup to S3 (Asynchronously replicated to antoher region)
- Automated snapshots
- Failed drives / nodes automatically replaced
- However, limited to a single availability zone
36
Redshift Scaling
- vertical and horizontal scaling on demand
- during scaling
- a new cluster is created while your old one remains available for reads
- CNAME is flipped to new cluster (a few mins of downtime)
- data moved in parallel to new compute nodes
- concurrency scaling
- automatically adds cluster capacity to handle increase in concurrent read queries
- support virtually unlimited concurrent users and queries
37
Redshift Distribution Styles
- AUTO (Redshift figures it out based on size of data)
- EVEN (rows distributed across slices in round-robin)
- KEY (rows distributed based on one column)
- ALL (entire table is copied to every node)
38
Redshift Sort Key
- rows are stored on disk in sorted order based on the column you designate as a sort key
- like an index
- makes for fast range queries
- choosing a sort key
- single vs compound vs interleaved
39
Redshift Importing Exporting Data
- COPY command
- parallelized and efficient
- from s3, emr, DynamoDB, remote host
- S3 requires a manifest file and IAM role
- UNLOAD command
- unload from a table into files in S3
40
Redshift COPY Command
- Use COPY to load large amounts of data from outside of Redshift
- If your data is already in Redshift in another table,
- use INSERT INTO ... SELECT
- or CREATE TABLE AS
- COPY can decrypt data as it is loaded from S3
- hardware-accelerated SSL used to keep it fast
- gzip, lzop and bzip2 compression supported to speed it up further
- automatic compression option
- analyze data and figures out optimal compression scheme for storing it
- Special Use Case : Narrow tables (lots of row, few columns)
- load with a single COPY transaction if possible
- otherwise hidden metadata columns consume too much space
41
Redshift DBLINK
- Connect Redshift to PostgreSQL
- Good way to copy and sync data between PostgreSQL and Redshift
42
Redshift Workload Management
- Prioritize short, fast queries vs long, slow queries
- Creates up to 8 queues
- default 5 queues with even memory allocation
- configuring query queue
- priority
- concurrency scaling mode
- user groups
- query groups
- query monitoring rules
43
Redshift Manual Workload Management
- One default queue with concurrency level of 5 (5 queries at once)
- Superuser queue with concurrency level 1
- Define up to 8 queues, up to concurrency level 50
44
Redshift Short Query Acceleration (SQA)
- Prioritize short-running queries over long running ones
- Short queries run in a dedicated space, won't wait in queue behind long queries
- Can be used in place of WLM queues for short queries
- con configure how many seconds is short
45
Redshift Resizing Clusters
- Elastic Resize
- quickly add or remove nodes of same type
- cluster is down for a few mins
- Classic Resize
- change node type or number of nodes
- cluster is read-only for hours to days
- Snapshot, restore, resize
- used to keep cluster available during a classic resize
46
Redshift VACUUM
- recovers space from deleted rows
- VACUUM FULL
- Sorts the specified table and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations
- VACUUM DELETE ONLY
- Reclaims disk space without sorting
- VACUUM SORT ONLY
- Sort specified table without reclaiming disk space
- VACUUM REINDEX
- Analyze distribution of sort key then performs a full VACUUM
47
Redshift New Features
- RA3 nodes with managed storage
- enable independent scaling of compute and storage
- ssd based
- redshift data lake export
- unload Redshift query to s3 in Apache Parquet format
- parquet is 2x faster to unload and consumes up to 6x less storage
- spatial data types
48
Redshift AQUA
- Advanced query accelerator
- pushes reduction and aggregation queries closer to the data
- up to 10x faster, no extra cost, no code changes
- benefits from high-bandwidth connection to s3
49
Redshift Anti Pattern
- small data sets
- OLTP
- unstructured data
- BLOB data
50
Redshift Security
- Using a Hardware Security Module (HSM)
- must use a client and server certificate to configure a trusted connection between Redshift and HSM
51
Redshift Serverless
- Automatic scaling and provisioning for your workload
- Optimizes costs and performance
- Uses ML to maintain performance across variable and sporadic workloads
- Easy spin up dev and test env
- Easy ad-hoc business analysis
52
Redshift Monitoring
- Monitoring views
- SYS_QUERY_HISTORY
- SYS_LOAD_HISTORY
- SYS_SERVERLESS_USAGE
- CloudWatch logs
- CloudWatch metrics
53
Amazon RDS
- Hosted relational database
- Aurora, MySQL, PostgreSQL, Oracle, etc
- Not for big data
54
RDS ACID
- Atomicity
- Consistency
- Isolation
- Durability
55
Amazon Aurora
- MySQL and PostgreSQL compatible
- up to 5x faster than MySQL, 3x faster than PostgreSQL
- 1/10 the cost of commercial database
- Up to 64TB per database instance
- Up to 15 read replicas
- Continuous backup to s3
- Replication across availability zones
- Automatic scaling with Aurora Serverless
56
Aurora Security
- VPC
- EAR : KMS
- EIF : SSL
57
Amazon QuickSight
- Business analytics service
- allows all users
- build visualization
- perform ad-hoc analysis
- quickly get business insights from data
- serverless
58
QuickSight SPICE
- Data sets are imported into SPICE
- super-fast, parallel, in-memory calculation engine
- user columnar storage, in-memory, machine code generation
- accelerates interactive queries on large data sets
- each user gets 10GB of SPICE
- highly available or durable
- scales to hundreds of thousands of users
59
QuickSight Use Cases
- Interactive ad-hoc exploration / visualization of data
- dashboard and KPI's
- Analyze / visualize data from
- logs in s3
- on-premise databases
- AWS (RDS, Redshift, Athena, S3)
- SaaS applications such as Salesforce
60
QuickSight Anti Pattern
- highly formatted canned reports
- ETL
61
QuickSight Security
- VPC
- Multi-Factor Authentication
- Row-level security
- Column-level security (Enterprise edition only)
62
QuickSight + Redshift Security
- By default QuickSight can only access data stored in the same region as one QuickSight is running within
- Problem : QuickSight in region A, Redshift in region B
- Solution : create a new security group with an inbound rule authorizing access from the IP range of QuickSight servers in that region
63
QuickSight User Management
- Users defined via IAM or email signup
- Active Directory connector with QuickSight Enterprise Edition
64
QuickSight Pricing
- Annual Subscription
- Standard : $9 / month / user
- Enterprise $18 / month / user
- Extra SPICE capacity
- $0.25 (standard) 0.38(Enterprise) /GB /user /month
65
QuickSight Dashboards
- read only snapshots of an analysis
- can share with others with QuickSight access
- can share even more widely with embedded dashboards
- embed within an application
66
QuickSight Machine Learning Insights
- ML powered anomaly detection
- ML powered forecasting
- Autonarratives