Section 22: Data & Analytics Flashcards by Marcello Amente

What is AWS Athena?

AWS Athena is an interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL.

How well did you know this?

Not at all

Perfectly

True or False: AWS Athena is a serverless service.

True.

How well did you know this?

Not at all

Perfectly

Fill in the blank: AWS Athena charges you based on the amount of ______ processed by your queries.

data

How well did you know this?

Not at all

Perfectly

What types of data formats does AWS Athena support?

Athena supports formats such as CSV, JSON, ORC, Parquet, and Avro.

How well did you know this?

Not at all

Perfectly

How does AWS Athena handle security?

Athena uses AWS Identity and Access Management (IAM) for access control and integrates with AWS Key Management Service (KMS) for data encryption.

How well did you know this?

Not at all

Perfectly

What is the maximum query result size in AWS Athena?

The maximum query result size in AWS Athena is 30 MB.

How well did you know this?

Not at all

Perfectly

Which AWS service is often used in conjunction with Athena for data cataloging?

AWS Glue.

How well did you know this?

Not at all

Perfectly

True or False: You need to provision servers to use AWS Athena.

False.

How well did you know this?

Not at all

Perfectly

What is the primary use case for AWS Athena?

Athena is primarily used for querying large datasets stored in Amazon S3 without the need for data loading.

How well did you know this?

Not at all

Perfectly

What SQL dialect does AWS Athena use?

Athena uses Presto SQL.

How well did you know this?

Not at all

Perfectly

Can AWS Athena query data stored in formats like Parquet and ORC?

Yes, Athena can query data stored in Parquet and ORC formats.

How well did you know this?

Not at all

Perfectly

What are the two main components of AWS Athena?

The two main components are the query engine and the data catalog.

How well did you know this?

Not at all

Perfectly

Fill in the blank: AWS Athena can be accessed via the ______ console, the AWS CLI, and the AWS SDKs.

AWS Management

How well did you know this?

Not at all

Perfectly

How does AWS Athena integrate with Amazon QuickSight?

Athena can be used as a data source for Amazon QuickSight to visualize data.

How well did you know this?

Not at all

Perfectly

What is the role of AWS Glue Data Catalog in relation to AWS Athena?

AWS Glue Data Catalog serves as a central repository to store metadata for the data queried in Athena.

How well did you know this?

Not at all

Perfectly

Can you use AWS Athena to join data from multiple S3 buckets?

Yes, you can join data from multiple S3 buckets in AWS Athena.

How well did you know this?

Not at all

Perfectly

True or False: AWS Athena supports partitioned tables.

True.

How well did you know this?

Not at all

Perfectly

What is the benefit of using partitioning in AWS Athena?

Partitioning improves query performance by reducing the amount of data scanned.

How well did you know this?

Not at all

Perfectly

What is the default retention period for query results in AWS Athena?

The default retention period for query results in AWS Athena is 45 days.

How well did you know this?

Not at all

Perfectly

Which AWS service can be used to schedule queries in AWS Athena?

AWS Lambda can be used to schedule queries in AWS Athena.

How well did you know this?

Not at all

Perfectly

Fill in the blank: AWS Athena can be used to analyze data in _____ time.

real

How well did you know this?

Not at all

Perfectly

What type of queries can be run on AWS Athena?

You can run ad-hoc queries, complex queries, and analytical queries on AWS Athena.

How well did you know this?

Not at all

Perfectly

True or False: AWS Athena allows you to create views.

True.

How well did you know this?

Not at all

Perfectly

In Athena what type of data can save costs and improve performance

columnar

How well did you know this?

Not at all

Perfectly

Related to file sizes, what files sizes should you use in Athena to increase performance?

128mb or greater

Athena functionality to use to run SQL queries across other datasources

Federated Query

What is AWS Redshift?

AWS Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.

True or False: AWS Redshift is designed for real-time data analytics.

False: AWS Redshift is optimized for complex queries and large-scale data analysis rather than real-time analytics.

What type of storage does AWS Redshift use?

AWS Redshift uses columnar storage, which organizes data by columns rather than rows.

Fill in the blank: AWS Redshift is based on __________ technology.

PostgreSQL

Which of the following is a key feature of AWS Redshift? A) Real-time streaming B) Columnar storage C) Unmanaged services D) On-premises deployment

B) Columnar storage

What is the maximum size of a single Redshift cluster?

The maximum size of a single Redshift cluster can be up to 128 petabytes.

What is the purpose of the Redshift Spectrum feature?

Redshift Spectrum allows you to run queries against data stored in S3 without loading it into Redshift.

True or False: AWS Redshift can automatically scale based on workload.

False: While you can manually resize Redshift clusters, automatic scaling is not a feature.

What is the primary benefit of using AWS Redshift for data warehousing?

The primary benefit is its ability to handle large datasets and perform complex queries quickly and efficiently.

What is a Redshift cluster?

A Redshift cluster is a set of nodes that work together to store and analyze data.

Fill in the blank: The main component of a Redshift cluster that manages the database is called the __________.

leader node

What is the purpose of the distribution style in Redshift?

Distribution style determines how data is distributed across the nodes in a cluster to optimize query performance.

Which SQL type does AWS Redshift support?

AWS Redshift supports a variant of SQL called Redshift SQL, which is based on PostgreSQL.

What is the function of the COPY command in Redshift?

The COPY command is used to load data from S3, DynamoDB, or other data sources into Redshift tables.

True or False: AWS Redshift provides built-in machine learning capabilities.

True: AWS Redshift has integrated machine learning features that allow users to create and run models directly within the data warehouse.

Is redshift OLAP or OLTP

OLAP: Online analytical processing

Athena or Redshift for faster queries?

Redshift

How does redshift snapshots provide disaster recovery

They can be copied to other regions

3 ways to load data into Redshift

Kinesis Firehose S3 copy command EC2 instance using JDBC

OpenSearch allows you you to query these types of matches

Partial

OpenSearch security

Cognito IAM KMS TLS

What does EMR stand for?

Elastic MapReduce

What is the main function of Amazon EMR?

It helps create Hadoop clusters to analyze and process vast amounts of data.

What type of AWS service is Amazon EMR?

A managed big data platform.

What AWS resource does an EMR cluster use to scale?

EC2 instances (can be hundreds of them).

Name some frameworks bundled with Amazon EMR.

Apache Spark, HBase, Presto, Flink

What does Amazon EMR handle automatically during cluster setup?

Provisioning and configuration.

What feature allows EMR to adjust resources based on workload?

Auto-scaling.

Can Amazon EMR use Spot Instances?

Yes, it's integrated with Spot Instances for cost savings.

What are common use cases for Amazon EMR?

Data processing, machine learning, web indexing, and big data workloads.

What is the role of the Master Node in EMR?

Manages the cluster, coordinates tasks, and monitors health.

What does the Core Node do in EMR?

Runs tasks and stores data (long-running).

What is the purpose of the Task Node in EMR?

Only runs tasks and is often configured as a Spot Instance.

What are the three purchasing options for EMR instances?

On-demand, Reserved, and Spot Instances.

What are the characteristics of On-Demand instances in EMR?

Reliable, predictable, and won't be terminated.

What are the benefits of Reserved Instances in EMR?

Cost savings and automatically used by EMR if available (minimum 1-year commitment).

What is the trade-off when using Spot Instances in EMR?

They are cheaper but can be terminated and are less reliable.

What are the two types of EMR clusters based on duration?

Long-running clusters and transient (temporary) clusters.

Serverless service to use to create business intelligence focused interactive dashboards

QuickSight

Related to QuickSight, what is spice?

In-Memory computation

With Glue, what prevents re-processing old data?

Glue Job Bookmarks

Glue functionality used to clean and normalize data using pre-built transformation

Glue DataBrew

Section 22: Data & Analytics Flashcards

(68 cards)