Domain 4: Analysis Flashcards by Vitor Sapucaia

RANDOM_CUT_FOREST

Kinesis Data Analytics SQL (or Flink) Function for anomaly detection in numeric columns

How well did you know this?

Not at all

Perfectly

Kinesis Firehose Buffer Limits

1 to 128 MB

60 to 900 seconds

How well did you know this?

Not at all

Perfectly

Kinesis Data Analytics Supported Sources

Kinesis Streams and Kinesis Firehose

How well did you know this?

Not at all

Perfectly

Kinesis Data Analytics Supported Destinations

Kinesis Streams, Kinesis Firehose, Lambda

How well did you know this?

Not at all

Perfectly

What happens if a record arrives late to a Kinesis Data Analytics application

Record is written to the error stream

How well did you know this?

Not at all

Perfectly

In what form does Kinesis Data Analytics provision capacity?

Kinesis Processing Units

How well did you know this?

Not at all

Perfectly

How much memory is provided per KPU?

4GB

How well did you know this?

Not at all

Perfectly

What is the default number of KPU per Kinesis Data Analytics application?

How well did you know this?

Not at all

Perfectly

What is the name of the visualization tool in the Elastic Stack?

Kibana

How well did you know this?

Not at all

Perfectly

Is ElasticSearch Serverless?

No, still have to scales servers

How well did you know this?

Not at all

Perfectly

What should ElasticSearch NOT be used for?

OLTP (RDS or DynamoDB instead)

- Ad-Hoc Querying (Athena instead)

How well did you know this?

Not at all

Perfectly

How can data be imported to ElasticSearch?

Kinesis, DynamoDB, Logstash, Beats, ElasticSearch API

How well did you know this?

Not at all

Perfectly

What query engine does Athena use?

Presto

How well did you know this?

Not at all

Perfectly

What data formats does Athena support?

CSV, JSON, Parquet, ORC, Avro

How well did you know this?

Not at all

Perfectly

Is Athena serverless?

Yes

How well did you know this?

Not at all

Perfectly

Does Athena support unstructured data?

Yes

How well did you know this?

Not at all

Perfectly

Which data formats are columnar?

ORC and Parquet

How well did you know this?

Not at all

Perfectly

Which data formats are splittable?

ORC, Parquet, Avro

How well did you know this?

Not at all

Perfectly

Which notebooks can Athena integrate with?

Jupyter, Zeppelin, RStudio

How well did you know this?

Not at all

Perfectly

What is the cost rate for Athena?

$5 per TB scanned

How well did you know this?

Not at all

Perfectly

Do cancelled queries count toward Athena charges?

Yes

How well did you know this?

Not at all

Perfectly

Do failed queries count toward Athena charges?

How well did you know this?

Not at all

Perfectly

What data format will be the most cost effective in Athena?

Columnar (ORC, Parquet)

How well did you know this?

Not at all

Perfectly

Does Athena charge for DDL processing?

How well did you know this?

Not at all

Perfectly

How can Athena results be encrypted?

Encrypt at rest in S3 using SSE-S3, SSE-KMS, CSE-KMS

Can Athena access S3 in another account?

Yes

How are Athena results encrypted in transit?

Transport Layer Security (TLS)

Is Redshift Serverless or Fully Managed?

Fully Managed?

What is the maximum number of compute nodes in a Redshift cluster?

128

What are the two types of compute nodes that can be selected for a Redshift cluster?

``` Dense Storage (DS) - uses HDDs for large size at low cost Dense Compute (DC) - uses SSD and lots of memory for faster performance at a higher cost ```

How many HDDs on an ds2.xlarge Redshift compute node?

3 for a total of 2TB storage

How many HDDs on an ds2.8xlarge Redshift compute node?

24 for a total of 16TB storage

How many SSDs on an dc2.large Redshift compute node?

160GB SSD storage, 15GB RAM

How many SSDs on an dc2.8xlarge Redshift compute node?

2.6TB SSD, 244GB RAM

What determines the number of Node Slices on a Compute Node?

The size of the Compute Node

What kind of data storage does Redshift use for high performance?

Columnar

Can you change the compression encoding for a column after a table is created in Redshift?

How many copies of your data is stored within Redhisft?

Three - one main on cluster, one backup on cluster, one snapshot in S3

Can Redshift data be backed up to another region?

Yes - asynchronously in S3

How many AZs is Redshift limited to?

One

What is the default Redshift distribution style

AUTO

What is the EVEN Redshift distribution style?

Steps through each slice and assigns data in round-robin fashion

What is the KEY Redshift distribution style?

Assigns data to each slice based on a selected key column. Ideal if you plan to query data on a specific column.

What is the ALL Redshift distribution style?

All data is replicated on every node in the cluster. Multiplies storage by the number of nodes in the cluster.

What are Redshift Sort Keys?

Similar to an index, makes for fast range queries

What are the three types of Redshift Sort Keys?

Single, Compound, Interleaved

What is the default types of Redshift Sort Key?

Compound

Does the order of Compound Sort Keys matter in Redshift?

Yes - first will be primary

What is required when performing COPY from S3 to Redshift?

Manifest File and IAM role

What is the command to copy Redshift data into S3?

UNLOAD

How can you configure S3 to Redshift connections without going over public internet?

Enhanced VPC routing

Can COPY decrypt S3 data as it is loaded into Redshift?

Yes, using hardware accelerated SSL

If loading a tall but narrow table to Redshift, what should you attempt to do for efficiency?

Try to use only one COPY command (metadata is added for each COPY command)

How do you copy a Redshift snapshot to another region?

1. Create KMS Key in destination region 2. Specify unique name for your snapshot copy grant 3. Specify the KMS Key for which you're creating the copy grant 4. In the source region, Enable copying of snapshots to the copy grant you created

What is Redshift DBLINK?

Connects Redshift to PostgreSQL (which could be on RDS) | MUST be in the same Availability Zone

Can data be imported from DynamoDB to Redshift?

Yes

What is Redshift Workload Management (WLM)?

Prioritizes short, fast queries vs long, slow queries

How can you configure Redshift WLM?

Redshift Console, CLI, or API

What is Redshift Concurrency Scaling?

Automatically adds cluster capacity to handle increases in concurrent read queries

How do Redshift WLM and Concurrency Scaling interact?

WLM queues can manage which queries are sent to concurrency scaling clusters

How many queues can be created with Redshift Automatic WLM?

8 (default of 5)

Is concurrency raised or lowered on large queries in Automatic WLM?

Lowered

How many queues can be created with Redshift Manual WLM?

8 (default 1)

What is the default concurrency of the default queue in Redshift Manual WLM?

What is the maximum concurrency level in Redshift Manual WLM?

What is query queue hopping?

Timed out queries automatically hop to another queue and retry

What is Redshift Short Query Acceleration (SQA)?

Prioritizes short queries. Alternative to WLM.

What statements does Redshift SQA support?

CREATE TABLE AS, and SELECT statements

How does Redshift SQA predict query execution time?

Machine Learning

What is the Redshift VACUUM command?

Recovers space from deleted rows?

What are the four types of Redshift VACUUM commands?

FULL, DELETE ONLY, SORT ONLY, REINDEX (reanalyzes interleaved sort keys)

What is Elastic Resize in Redshift?

Quickly add or remove nodes of the same type. Low downtime. For some types, you can only double of halve the nodes.

What is Classic Resize in Redshift?

Change node type and/or number of nodes. Can lead to hours or days of read-only.

What is Redshift Snapshot, restore, resize?

Used to keep cluster available during a Classic resize. Minimizes downtime.

What are Redshift RA3 nodes?

Allow you scale compute and storage capacity independently

What is Redshift Data Lake Export?

Unloads Redshift to S3 in Parquet format

What are some advantages of Parquet?

2x faster, 6x smaller, automatically partitioned, compatible with many services (Spectrum, Athena, EMR, Sagemaker)

What does ACID stand for?

Atomicity, Consistency, Isolation, Durability

What port does Kibana run on?

5601

Do you have to use Glue Data Catalogs when using Athena?

No, you can use standard Athena Data Catalogs

What language does Glue's ETL engine use?

Python

Can Athena invoke SageMaker models?

Yes

Is Athena's Data Catalog Hive metastore compatible?

Yes

When should you use Redshift?

Many different sources, highly structured, single source of truth, stored for long periods of time, performant on large sizes of data

When should you use Athena?

Don't want to worry about formatting or infrastructure, quick queries for troublehsooting, ad-hoc

When should you use EMR?

Need a wide variety of custom processing tasks, fine grained control over your clusters, custom code

What is Federated query in Athena?

allows you to run SQL queries across variety of relational, non-relational, and custom data sources. A unified way to run SQL queries across various data stores.

Can Athena read from compressed files?

Yes

Can Hive Query be run on Athena?

No (only Presto is supported)

What is SerDe?

Serializer/Deserializer; libraries that tell Hive how to interpret data formats; also used by Athena

What needs to be done to add data to a partitioned table in Athena?

ALTER TABLE ADD PARTITION

Can Athena access an S3 bucket in another account?

Yes

Domain 4: Analysis Flashcards

(92 cards)