Data and Analytics Flashcards

1
Q

What is AWS Athena?

A

Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. Analyze data or build applications from an Amazon Simple Storage Service (S3) data lake and 25+ data sources, including on-premises data sources or other cloud systems using SQL or Python.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to improve Athena’s performance?

A

Use columnar data for cost-saving and performance improvement. The charges of Athena is based on the amount of data scanned. Parquet and ORC are the popular formats that support columnar data. Other formats can be converted into these formats by using Glue.

Use compressed data for small retrievals (zip, gzip, lx4, etc.)
Partition the data set in S3 for easy querying on virtual columns.
Use larger files for better performance improvement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Athena federated query?

A

If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.
Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Redshift?

A
  • Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This allows you to use your data to gain new insights for your business and customers.
  • RedShift is based on PostgresSQL and but it’s not used for OLTP. It is used for OLAP (online analytical processing).
  • Data is stored in columnar format.
  • you pay based on the provisioned instance
  • you can query the data by using SQL statements
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is AWS Redshift cluster?

A

An Amazon Redshift cluster consists of nodes. Each cluster has a leader node, and one or more compute nodes. The leader node receives queries from client applications, parses the queries, and develops query execution plans. The leader node then coordinates the parallel execution of these plans with the compute nodes and aggregates the intermediate results from these nodes. It then finally returns the results back to the client applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how disaster recovery is done in RedShift?

A

There is no multi-AZ mode in RedShift. Snapshots of clusters are taken point in time and stored on S3. the snapshots are incremental - a snapshot is taken of what has changed.
- A snapshot can be restored into a new cluster
- A snapshot can be taken automatically or manually. Automatic snapshots can be taken every 8 hours. You can set the retention of the automated backup copies. The manually taken snapshots are retained until you delete them.
-You can configure redshift to copy snapshots of a cluster to another AWS region automatically. This is helpful in disaster recovery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Redshift Spectrum?

A

With Redshift Spectrum, an analyst can perform SQL queries on data stored in Amazon S3 buckets. This can save time and money because it eliminates the need to move data from a storage service to a database, and instead directly queries data inside an S3 bucket. Redshift Spectrum also expands the scope of a given query because it extends to the thousands of Redshift spectrum nodes to execute the query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Redshift VPC routing

A

When you use Amazon Redshift enhanced VPC routing, Amazon Redshift forces all COPY and UNLOAD traffic between your cluster and your data repositories through your virtual private cloud (VPC) based on the Amazon VPC service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Amazon OpenSearch?

A

OpenSearch is a distributed, community-driven, Apache 2.0-licensed, 100% open-source search and analytics suite used for a broad set of use cases like real-time application monitoring, log analytics, and website search. OpenSearch provides a highly scalable system for providing fast access and response to large volumes of data with an integrated visualization tool, OpenSearch Dashboards, that makes it easy for users to explore their data.

with OpenSearch, you can search any field that partially matches. It is common to use open search as a complement to another database.

  • open search requires a cluster of instances - it is not serverless.
  • It does not support SQL - it has its query language
  • ingestion from kinesis data fire hose, AWS IoT and cloud, and CloudWatch logs
  • secure security through cognitive and I AM KMS encryption, TLS.
  • Comes with open search dashboards (visualization)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is AWS EMR?

A

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.

The following are node types:
Master node: A node that manages the cluster by running software components to coordinate data distribution and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the cluster’s health. Every cluster has a master node, and it’s possible to create a single-node cluster with only the master node.

Core node: A node with software components that run tasks and store data in your cluster’s Hadoop Distributed File System (HDFS). Multi-node clusters have at least one core node.

Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are glue job bookmarks?

A

AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Amazon MSK?

A

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data. Amazon MSK provides control-plane operations, such as those for creating, updating, and deleting clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly