Analysis Flashcards

Question

Amazon Athena Use Cases

Answer 1

- ad-hoc queries of web logs - querying staging data before loading to redshift - analyze cloudtrail / cloudfront / vpc logs in s3 - integration with Jupyter, Zeppelin, RStudio, QuickSight and other visualization tools

Answer 2

- can organize users / teams / apps / workloads into WORKGROUPS - can control query access and track costs by Workgroups - Each workgroup has its own - query history - data limits - iam policies - encryption settings

Answer 3

- Pay as you go - $5 per TB scanned - sccessful or cancelled queries count. Failed queries do not count - No charge for DDL (CREATE/ALTER/DROP etc) - Save lots of money by using columnar formats - orc, parquet - save 30-90% and get better performance

Answer 4

- Transport Layer Security (TLS) encrypts in-transit between Athena and S3

Answer 5

- Highly formatted reports / visualization - QuickSight better - ETL - use Glue instead

Answer 6

- Use columnar data (orc, parquet) - small number of large files performs better than large number of small files - Use partitions

Answer 7

- Powered by Apache Iceberg - Just add 'table_type' = 'ICEBERG' in create table statement - concurrent users can safely make row-level modifications - compatible with EMR, Spark, anything that supports Icebery format - removes need for custom record locking - time travel operations

Answer 8

- Fully managed, petabyte scale data warehouse - Designed for OLAP not OLTP - Cost effective - SQL, ODBC, JDBC interfaces - Scale up or down on demand - Built in replication and backups - Monitoring via CloudWatch / CloudTrail - Query exabytes of unstructured data in S3 without loading - limitless concurrency - Horizontal scaling - Separate compute and storage resources - Wide variety of data formats - Support of Gzip and Snappy compression

Answer 9

- Accelerate analytics workloads - Unified data warehouse and data lake - Data warehouse modernization - Analyze global sales data - Store historical stock trade data - Analyze ad impressions and clicks - Aggregate gaming data - Analyze social trends

Answer 10

- Massively Parallel Processing - Columnar Data Storage - Column Compression

Answer 11

- Replication within cluster - Backup to S3 (Asynchronously replicated to antoher region) - Automated snapshots - Failed drives / nodes automatically replaced - However, limited to a single availability zone

Answer 12

- vertical and horizontal scaling on demand - during scaling - a new cluster is created while your old one remains available for reads - CNAME is flipped to new cluster (a few mins of downtime) - data moved in parallel to new compute nodes - concurrency scaling - automatically adds cluster capacity to handle increase in concurrent read queries - support virtually unlimited concurrent users and queries

Answer 13

- AUTO (Redshift figures it out based on size of data) - EVEN (rows distributed across slices in round-robin) - KEY (rows distributed based on one column) - ALL (entire table is copied to every node)

Answer 14

- rows are stored on disk in sorted order based on the column you designate as a sort key - like an index - makes for fast range queries - choosing a sort key - single vs compound vs interleaved

Answer 15

- COPY command - parallelized and efficient - from s3, emr, DynamoDB, remote host - S3 requires a manifest file and IAM role - UNLOAD command - unload from a table into files in S3

Answer 16

- Use COPY to load large amounts of data from outside of Redshift - If your data is already in Redshift in another table, - use INSERT INTO ... SELECT - or CREATE TABLE AS - COPY can decrypt data as it is loaded from S3 - hardware-accelerated SSL used to keep it fast - gzip, lzop and bzip2 compression supported to speed it up further - automatic compression option - analyze data and figures out optimal compression scheme for storing it - Special Use Case : Narrow tables (lots of row, few columns) - load with a single COPY transaction if possible - otherwise hidden metadata columns consume too much space

Answer 17

- Connect Redshift to PostgreSQL - Good way to copy and sync data between PostgreSQL and Redshift

Answer 18

- Prioritize short, fast queries vs long, slow queries - Creates up to 8 queues - default 5 queues with even memory allocation - configuring query queue - priority - concurrency scaling mode - user groups - query groups - query monitoring rules

Answer 19

- One default queue with concurrency level of 5 (5 queries at once) - Superuser queue with concurrency level 1 - Define up to 8 queues, up to concurrency level 50

Answer 20

- Prioritize short-running queries over long running ones - Short queries run in a dedicated space, won't wait in queue behind long queries - Can be used in place of WLM queues for short queries - con configure how many seconds is short

Answer 21

- Elastic Resize - quickly add or remove nodes of same type - cluster is down for a few mins - Classic Resize - change node type or number of nodes - cluster is read-only for hours to days - Snapshot, restore, resize - used to keep cluster available during a classic resize

Answer 22

- recovers space from deleted rows - VACUUM FULL - Sorts the specified table and reclaims disk space occupied by rows that were marked for deletion by previous UPDATE and DELETE operations - VACUUM DELETE ONLY - Reclaims disk space without sorting - VACUUM SORT ONLY - Sort specified table without reclaiming disk space - VACUUM REINDEX - Analyze distribution of sort key then performs a full VACUUM

Answer 23

- RA3 nodes with managed storage - enable independent scaling of compute and storage - ssd based - redshift data lake export - unload Redshift query to s3 in Apache Parquet format - parquet is 2x faster to unload and consumes up to 6x less storage - spatial data types

Answer 24

- Advanced query accelerator - pushes reduction and aggregation queries closer to the data - up to 10x faster, no extra cost, no code changes - benefits from high-bandwidth connection to s3

Answer 25

- small data sets - OLTP - unstructured data - BLOB data

Answer 26

- Using a Hardware Security Module (HSM) - must use a client and server certificate to configure a trusted connection between Redshift and HSM

Answer 27

- Automatic scaling and provisioning for your workload - Optimizes costs and performance - Uses ML to maintain performance across variable and sporadic workloads - Easy spin up dev and test env - Easy ad-hoc business analysis

Answer 28

- Monitoring views - SYS_QUERY_HISTORY - SYS_LOAD_HISTORY - SYS_SERVERLESS_USAGE - CloudWatch logs - CloudWatch metrics

Answer 29

- Hosted relational database - Aurora, MySQL, PostgreSQL, Oracle, etc - Not for big data

Answer 30

- Atomicity - Consistency - Isolation - Durability

Answer 31

- MySQL and PostgreSQL compatible - up to 5x faster than MySQL, 3x faster than PostgreSQL - 1/10 the cost of commercial database - Up to 64TB per database instance - Up to 15 read replicas - Continuous backup to s3 - Replication across availability zones - Automatic scaling with Aurora Serverless

Answer 32

- VPC - EAR : KMS - EIF : SSL

Answer 33

- Business analytics service - allows all users - build visualization - perform ad-hoc analysis - quickly get business insights from data - serverless

Answer 34

- Data sets are imported into SPICE - super-fast, parallel, in-memory calculation engine - user columnar storage, in-memory, machine code generation - accelerates interactive queries on large data sets - each user gets 10GB of SPICE - highly available or durable - scales to hundreds of thousands of users

Answer 35

- Interactive ad-hoc exploration / visualization of data - dashboard and KPI's - Analyze / visualize data from - logs in s3 - on-premise databases - AWS (RDS, Redshift, Athena, S3) - SaaS applications such as Salesforce

Answer 36

- highly formatted canned reports - ETL

Answer 37

- VPC - Multi-Factor Authentication - Row-level security - Column-level security (Enterprise edition only)

Answer 38

- By default QuickSight can only access data stored in the same region as one QuickSight is running within - Problem : QuickSight in region A, Redshift in region B - Solution : create a new security group with an inbound rule authorizing access from the IP range of QuickSight servers in that region

Answer 39

- Users defined via IAM or email signup - Active Directory connector with QuickSight Enterprise Edition

Answer 40

- Annual Subscription - Standard : $9 / month / user - Enterprise $18 / month / user - Extra SPICE capacity - $0.25 (standard) 0.38(Enterprise) /GB /user /month

Answer 41

- read only snapshots of an analysis - can share with others with QuickSight access - can share even more widely with embedded dashboards - embed within an application

Answer 42

- ML powered anomaly detection - ML powered forecasting - Autonarratives

Analysis Flashcards

(66 cards)