Analytics Flashcards

1
Q

What does Glue create when it scans your unstructured data in S3?

A

It creates metadata which can be used to query the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Hive?

A

It allows you to run SQL like queries from EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Can you import a Hive metastore into AWS Glue?

A

Yes. You can also import AWS Glue metadata into Hive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can you increase the performance of a Spark Job?

A

Provision additional DPUs (Data processing units).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can you determine how many DPUs you will need for your job?

A

Enable job metrics to understand the maximum capacity in DPUs that you will need.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Where are Glue errors reported?

A

Cloudwatch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you schedule Glue jobs?

A

Glue Scheduler.. This is the most straight forward approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a DynamicFrame in AWS Glue?

A

a collection of dynamicRecords

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a DynamicRecord in AWS Glue?

A

They are records that are self-describing and have a schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Using native AWS Glue functionality, how can you drop fields or null fields?

A

DropFields or DropNullFields transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Using AWS Glue, how can you select a subset of records during your ETL process?

A

Using filter transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you enrich your data from another source in AWS Glue?

A

Use the join transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the Map transformation in AWS Glue do?

A

It allows you to add fields, delete fields, and perform external lookups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the FindMatches ML transformation do in AWS Glue?

A

It identifies duplicate or matching records in your dataset. Even when the records do not have a common identifier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What format conversions can AWS Glue support?

A

CSV, JSON, Avro, Parquet, ORC, XML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does AWS Glue ResolveChoice do?

A

It deals with ambiguities in your DynamicFrame and returns a new one. example is two fields with the same name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you update your Glue Data Catalog?

A

You can re-run the crawler or have a script use enableUpdateCatalog / updateBehavior

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are AWS Glue Development endpoints?

A

They allow you to use a notebook to develop your ETL script. They are launched in a VPC and can be used with SageMaker notebook or Zeppelin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What do AWS Job bookmarks do?

A

It keeps track of where you left off so you are not reprocessing old data. Works with S3 sources and relational databases. It only works with new rows in a database, not updated ones. The primary key also needs to be sequential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Can you start a step function from an AWS Glue event?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How are you billed using AWS Glue?

A

You are billed by the second.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How are AWS Glue development endpoints billed?

A

By the minute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

If you want to use engines like Hive or Pig, what AWS service is the best fit?

A

EMR. Glue is based on Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Can AWS Glue process streaming data?

A

Yes. It can do this from Kinesis or Kafka.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Can AWS Glue clean and transform streaming data in-flight?
yes
26
What is AWS Glue Studio?
It is a visual interface for ETL workflows
27
Where can you view the status of AWS Glue Jobs running?
In the Glue Studio Monitoring console
28
What is AWS Glue Data Quality?
It evaluates the data based on rules that you set. It uses DQDL for custom development.
29
What is AWS Glue Data Brew?
A visual preparation tool for transforming data.
30
What are Glue Data Brew sources?
S3, data warehouse, database
31
Where does Glue Data Brew output data?
S3
32
What is a recipe in Data Brew?
It is a saved set of transformations that can be applied to any dataset
33
Can you define data quality rules in Data Brew?
Yes
34
How does Redshift and Snowflake integrate with Data Brew?
You can use custom SQL to create datasets
35
Does Data brew integrate with KMS?
Yes, but only with customer master keys KMS SSE-C
36
Can you schedule a job in Data Brew?
Yes
37
How can you remove PII in Data Brew
Substitution Shuffling Deterministic_Encrypt Probablistic Encryption NULL or DELETE MASK OUT HASH
38
What are AWS Event Bridge Batch conditions?
It only fires an event when a specific number of events or seconds within a time period are exceeded.
39
What is AWS Lake Formation?
It makes it easy to set up a secure data lake in days.
40
What can you do in Lake Formation?
Anything that you can do in Glue. It is built on Glue.
41
What AWS services can query lake formation?
Athena, Redshift, and EMR
42
Can you have multiple accounts accessing Lake Formation?
Yes. The recipient must be a data lake administrator. You can leverage AWS RAM for this as well.
43
Does Lake Formation support manifests?
No
44
What are AWS Lake Formation Governed Tables?
They support ACID transactions across multiple tables. This cannot be changed once enabled. Also works with Kinesis streaming data.
45
How does Lake Formation optimize storage performance?
Automatic compaction
46
How can you control access to Lake Formation data?
Granular row and column level access
47
Other than IAM, what else can Lake Formation tie into for permissions?
SAML or external AWS accounts
48
What are Lake formation policy tags attached to?
Databases, tables, or columns
49
What are AWS Lake Formation Filters?
They provide column, row, or cell level security. This done when granting select permissions on tables.
50
What is AWS Athena?
A query service for your data in S3. Data all stays in S3
51
Is Athena serverless?
yes
52
What data formats are splitable for parallel processing in Athena?
ORC, Parquet, and Avro
53
What are Athena Workgroups?
They organize users, teams, and applications into groups. You can control access and track costs by workgroup. They integrate with IAM, CoudWatch, and SNS
54
Can you set query limits in Athena by using workgroups?
Yes. You can limit how much data is returned.
55
Are Athena canceled queries billable?
Yes. Only failed queries are not billable.
56
Are CREATE / ALTER / DROP queries billable in Athena?
No
57
What can you do to save money querying data with Athena?
Use a columnar format such as ORC or Parquet. You will scan less data.
58
Do large files perform better in Athena?
Yes. A small number of large files performs better than a large number of small files.
59
What should you run when you partition after the fact in Athena?
Run MSCK REPAIR TABLE
60
If you want to ensure your table is ACID compliant in Athena, what table type should you use?
ICEBERG
61
What are Athena time travel operations?
You can recover data recently deleted with a SELECT statement
62
What should you do if your ACID transactions in Athena are getting slower over time
Optimize table command using bin_pack where catalog = N
63
How granular does Athena get with permissions?
Database and table level.
64
What can you use to query Spark directly?
Spark SQL
65
Does Spark have machine learning capabilities?
Yes, using MLLib
66
Can you process streaming data with Spark?
Yes. It integrates with Kinesis and Kafka
67
Can you change data formats in Athena?
Yes, using CTAS and the format attribute.
68
What is Spark Structured Streaming?
It just keeps appending to a table and you query by using windows of time.
69
Can Spark support Redshift?
Yes
70
Can you run a Jupyter notebook with Spark within the Athena console?
Yes
71
What is AWS EMR?
A managed hadoop framework that runs on EC2. Uncludes Spark, HBase, Presto, Flink, Hive, and more
72
What are EMR notebooks
Browser based development in a notebook.
73
Whare are the EMR node types?
Master, Core, and Task node
74
Where does data persist in EMR?
On the core nodes in HDFS
75
Do EMR task nodes store data?
No
76
What is a good strategy to reduce EMR costs?
Use spot instances for task nodes since they do not persist data.
77
What is a transient EMR cluster?
One that terminates once all the steps are complete. Good for cost savings.
78
What is a long running cluster?
One that must be manually terminated. A good use of reserved instances for cost savings.
79
When do you configure frameworks and applications in EMR?
When the cluster is launched.
80
How can you run EMR jobs directly?
By connecting to the master node or submitting jobs via ordered steps in the console.
81
When using EMR, where can you store data for it to persist?
S3
82
How can you schedule the start of your EMR cluster?
The AWS Data Pipeline
83
Is HDFS persistent?
No.
84
What is the default block size in HDFS
128MB
85
What is EMRFS
Allow you to access S3 as if it was HDFS
86
What is EMRFS Consistent View?
Uses for consistency and uses DynamoDB to track consistency. S3 is now strongly consistent in 2021.
87
Can you use EBS for HDFS?
Yes. It will be ephemeral though. Can only be attached when launching the cluster.
88
How is EMR billed?
By the hour
89
How do you increase processing capacity in EMR?
You can add task nodes on the fly as long as you do not also need to increase storage capacity.
90
How do you increase processing and storage capacity in EMR?
Resize the cluster core nodes.
91
What is EMR Managed Scaling?
Adds core nodes and then task nodes up to the max units specified. It also scales down to your configured value.
92
When scaling down, which EMR nodes get removed first.
Spot nodes (task and then core)
93
Can you specify the resources needed for your job in EMR Serverless?
Yes. Without configuring this, EMR will calculate the value on its own.
94
Is EMR multi-region?
No
95
How does EMR Serverless Application Lifecycles move from step to step.
API calls. This is not automatic!!
96
Can EMR run on EKS?
Yes. It can run alongside other applications.
97
What is a record made up of in Kinesis Data Streams when it is sent from the producer?
A partition key and a datablob (up to 1MB)
98
How fast is a shard in Kinesis Data Streams when being sent from the producer to the stream?
1MB per second or 1000msg per second per shard.
99
What is a record made up of in Kinesis Data Streams when it is sent to the consumer?
A partition key, sequence numvber, and a datablob (up to 1MB)
100
How fast is a shard in Kinesis Data Streams when being sent from the stream to the consumer?
Shared Mode 2MB per second per shard across all shards. Enhanced Mode 2MB per second per shard per consumer
101
What is the maximum retention for a Kinesis Data Stream?
Between 1 and 365 days
102
Can you replay data in Kinesis?
Yes.
103
What is the provisioned capacity mode in Kinesis Data Streams?
You choose the number of shards and scale manually.
104
What is the on-demand capacity mode in Kinesis Data Streams?
Automatically scales based on observed throughput. This is 4MB/s or 4K per second
105
How can you increase throughput using the Kinesis Producer SDK?
Using Putrecords for batching.
106
What is the best use case for the Kinesis Producer SDK?
Low throughput, higher latency.
107
What managed AWS sources send to Kinesis Data Streams?
CloudWatch, AWS IoT, and Kinesis Data Analytics
108
What APIs are included in the Kinesis Producer Library?
Synchronous and asynchronous
109
Does Kinesis Producer Library support record compression?
No
110
How do you add delay in Kinesis Producer Library batching?
RecordMaxBufferedTime
111
Can Apache Spark consume Kinesis Data Streams?
Yes
112
What is the maximum amount of data returned by the Kinesis SDK GetRecords function?
10MB or up to 10000 records
113
What is the Maximum GetRecords API calls per shard per second?
5
114
What is checkpointing in the Kinesis Client Library?
It marks your progress.
115
When you are checkpointing using the KCL and you recieve the ExpiredIteratorException, what does this mean?
You need to increase the WCU of DynamoDB
116
What is the Kinesis Connector Library?
It sends data to S3, DynamoDb, RedShift, Opensearch, etc.. It lives on an EC2 instance. Kind of deprecated.
117
Why is Kinesis Enhanced Fanout fast?
It uses HTTP/2 to push to consumers.
118
What is the latency when Kinesis Enhanced Fanout is enabled?
Less than 70ms
119
When should you use Kinesis Standard Consumers?
When there is a low number of consumers You can tolerate 200ms latency Cost effective
120
When should you use Kinesis Enhanced Fan Out Consumers?
when you have multiple consumer applications for the same stream Low Latency Higher Cost
121
What is the default limit of consumers per data stream when using enhanced fan-out in Kinesis?
20, but you can ask for a service request to increase it.
122
What happens when you split a hot shard in Kinesis?
Two new shards are created The old shard will go away when the data expires
123
What happens when you merge a shard in Kinesis?
One shard is created The old shards will go away when the data expires
124
What can cause out of order shards in Kinesis?
Resharding can cause this. make sure you read entirely from the parent before reading from the new records. This is built into the KCL
125
Can Kinesis Resharding be done in Parallel?
No
126
How many resharding operations can be performed at once?
One.. This is a problem when you have thousands of shards.
127
What can cause duplicates from your Kinesis Producer?
Network Timeouts. Use unique IDs to deduplicate records on the consumer side.
128
What use cases can cause a consumer duplicate in Kinesis?
A worker terminates unexpectedly A worker instance is added or removed Shards are merged or split The application is deployed
129
What can you do to fix duplicate consumer records in Kinesis?
make your application idempotent Handle duplicates at the final destination
130
When using a Kinesis Data Stream, how can you transform the data before storing it in S3?
With a Lambda in Kinesis Data Firehose
131
Can Kinesis Firehose write to Redshift?
Yes. It loads to S3 first and hen issues a COPY command.
132
Can Kinesis Data Firehose write to openSearch?
Yes
133
Can Firehose deliver to custom locations?
Yes as long as there is an HTTP endpoint
134
Can you store data sent into Kinesis Firehose?
Yes. All or failed data can be stored in S3 before the data is sent to S3 in a batch write.
135
What is the minimum latency for Firehose?
60 seconds
136
Can Firehose perform data conversions?
Limted, but yes. JSON to ORC but only for S3.. Others are done using Lambda.
137
Can Firehose compress your data before sending it to S3?
Yes using Gzip, zip, or snappy
138
Can Spark or Kinesis Client Library read from Data Firehose?
No
139
What determines when records are sent in Kinesis Data Firehose?
The buffer size and buffer time. Whichever limit is hit first.
140
What are the minimum values for Kinesis buffer size and time?
Buffer size is a few MB Buffer time is 1 minute
141
If you need real-time data made searchable using kinesis, what would you use?
Kinesis streams with a Lambda to send the data to OpenSearch
142
What is a Cloudwatch subscription filter?
A subscription filter allows you to connect to other AWS services like Lambda, Data Streams, etc..
143
Can Kinesis Data Analytics send to Lambda?
Yes. This can be used to encrypt, translate to another format, aggregate rows, etc..
144
What can Kinesis Data Analytics integrate with that Firehose cannot?
Dynamo DB, Aurora, SNS, SQS, Cloudwatch
145
What is Kinesis Data Analytics now called?
Managed Service for Apache Flink
146
What can you use in Managed Service for Apache Flink to access SQL?
Table API
147
What are some good use cases for Managed Service for Apache Flink?
Streaming ETL Continuous metric generation Responsive analytics
148
What is Kinesis Analytics Schema Discovery
It analyzes the schema real-time
149
What is Kinesis Data Analytics RANDOM_CUT_FOREST?
It detects anomalies in your data.
150
What is AWS MSK?
Managed streaming for Apache Kafka. An alternative to Kinesis.
151
What is the maximum message size for AWS MSK?
10MB. This is much larger than Kinesis Data Streams at 1MB
152
Can you persist data in AWS MSK?
Yes. It uses EBS volumes and is more flexible than Kinesis Data Streams.
153
Can you control who writes to a topic in AWS MSK?
Yes. This can be done using: Mutual TLS and Kafka ACLs IAM Access control SASL/SCRAM and Kafka ACLs
154
What is AWS MSK Connect?
It allows you to connect to other AWS services for delivery such as S3, Redshift, Opensearch, etc..
155
Can AWS MSK be Serverless?
Yes.
156
What is AWS OpenSearch
Used to be known as Elasticsearch. Petabyte scale analysis and reporting .
157
What are good use cases for OpenSearch?
Full-Text searching Log analytics Application Monitoring Security Analytics
158
What are Types in OpenSearch?
They define the schema and mapping shared by documents.
159
What are Indices in OpenSearch?
An index. They contain inverted indices that you search across everything within them at once.
160
What is the structure of an index in OpenSearch?
They are split into shards and documents are hashed to a particular shard. Shards can be on different nodes in a cluster.
161
Can you offload reads in OpenSearch?
Yes, using replicas.
162
What is a domain in OpenSearch?
It is essentially the cluster.
163
How can you back your data up in OpenSearch?
Snapshot to S3
164
Does OpenSearch support resource or identity based policies?
Both. It also supports request signing and IP based policies.
165
How can you allow access to opensearch through a VPC to external users?
Using Cognito / SAML, Reverse Proxy, SSH, VPC Direct Connect, or a VPN
166
What type of storage does an OpenSearch data node use by default?
Hot storage. This is an instance store or EBS volume.
167
What is UltraWarm storage in OpenSearch
It uses S3 and Caching Best for indices with few writes (log data / immutable data) slower performance requires a dedicated master node
168
What is Cold storage in OpenSearch
Uses S3 Best for periodic research or forensic analysis on older data Must have dedicated master node UltraWarm must also be enabled
169
Can storage data in OpenSearch bet migrated between storage types?
Yes
170
What is Index State Management in OpenSearch?
Automates index management policies: automates snapshots deletes indices over a period of time Move indices from hot to cold over time Reduce Replica Count
171
How often are index state management policies run in OpenSource?
Every 30 - 48 minutes
172
What are index Rollups in OpenSearch?
They roll up old data into summarized indices. New index may have fewer fields. Good to save on storage.
173
What are index transforms in OpenSearch?
Like rollups, but purpose is to create a different view to analyze the data differently.
174
Can you replicate data across clusters in OpenSearch?
Yes.
175
What is a follower index in OpenSearch?
It pulls from the leader index to replicate data.
176
How do you copy indices from cluster to cluster on demand in OpenSearch?
Remote Reindex
177
What is the best practice for master nodes?
Have three
178
What should you do when you see JVMMemory Pressure Errors in OpenSearch?
Delete old or unused indices.
179
What is a big pro for OpenSearch Serverless?
On-Demand autoscaling
180
What are the two collection types in OpenSearch Serverless?
search or time series
181
What are the sources of QuickSight?
Redshift, Aurora, RDS, Athena, OpenSearch, IoT Analytics, Your own database, raw files like csv, excel, log files, etc...
182
Can QuickSight perform ETL?
Very light ETL.
183
What is Quicksight Spice?
Your datasets get imported into spice. Each user gets 10GB of Spice. It accelerates large queries.
184
What happens when importing data from Athena to Spice takes more than 30 minutes?
It times out.
185
What is a good use case for Quicksight.
Ad-hoc exploration and visualization Dashboards and KPIs
186
Does Quicksight support MFA?
Yes
187
Does QuickSight support row and column level security?
Yes. Row level security is available in standard, but column level security is only available in the enterprise edition.
188
What data security permissions need to be added to Quicksight?
You need to make sure QuickSight can access your data. You need to create IAM policies that restrict what data in S3 users can see.
189
Can quicksight access RedShift data in other regions?
No. Quicksight can only acces Redshift data in the same region.
190
How do you access RedShift data to get data from another region using quicksight standard?
Use an inbound security group to allow access to Redshift from the Quicksight IP range.
191
If you want to keep QuickSight in a private VPC, what version do you need?
Enterprise Edition
192
How do you access RedShift data to get data from another region using enterprise?
Use private subnets and peering connections. Route tables will tie it together. It can be used for cross account access using transit Gateway
193
If you want to use an Active Directory connector for quicksight, what version do you need?
Enterprise edition
194
Can you use customer managed keys in Quicksight?
No. Enterprise edition allows you to use KMS.
195
What is Quicksight Q?
An NLP interface on top of QuickSight
196
Can Spice be added to a user?
Yes. It is billed by additional GB of spice needed.
197
Is encryption at-rest included in the standard version of QuickSight?
No
198
Can you embed dashboards into 3rd party apps using QuickSight?
Yes, using the Javascript SDK
199
What needs to be done for embedded dashboards to work on a 3rd party site using QuickSight?
Domain Whitelisting
199
What ML capabilities does QuickSight have?
anomaly detection forecasting - seasonality and trends over times. imputes missing values autonarratives - a story of your data in paragraph format. Suggested insights - helps decide which feature is right for your dataset.
200