Hadoop Ecosystem Fundamentals of Distributed Systems Flashcards

1
Q

What is AWS Athena?

A

AWS Athena is an interactive query service provided by Amazon Web Services (AWS) that allows users to analyze data stored in Amazon S3 using standard SQL queries. It eliminates the need for managing infrastructure and enables users to query data directly from S3 without the need to load it into a separate database or data warehouse. Athena supports various file formats, including CSV, JSON, Parquet, and ORC, making it versatile for analyzing structured, semi-structured, and unstructured data stored in S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is AWS Glue?

A

AWS Glue is a fully managed extract, transform, and load (ETL) service provided by Amazon Web Services (AWS) for preparing and loading data for analytics. It automates the process of discovering, cataloging, cleaning, and transforming data, making it easier to prepare data for analysis. Glue offers both visual and code-based interfaces for building ETL jobs, and it integrates with various AWS services such as Amazon S3, Amazon RDS, and Amazon Redshift.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is AWS Glue?

A

AWS Glue is a fully managed extract, transform, and load (ETL) service provided by Amazon Web Services (AWS) for preparing and loading data for analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does AWS Glue automate?

A

AWS Glue automates the process of discovering, cataloging, cleaning, and transforming data, making it easier to prepare data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which underlying engine does AWS Glue use for executing ETL jobs?

A

AWS Glue uses Apache Spark as its underlying engine for executing ETL jobs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What interfaces does AWS Glue offer for building ETL jobs?

A

AWS Glue offers both visual and code-based interfaces for building ETL jobs, providing flexibility for users with different preferences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

With which AWS services does AWS Glue integrate?

A

AWS Glue integrates with various AWS services such as Amazon S3, Amazon RDS, and Amazon Redshift, allowing seamless data processing and integration across different AWS data sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the maturity stages of AWS Glue or a similar technology?

A

Initial Stage: In this stage, organizations start exploring AWS Glue or similar ETL tools, experimenting with basic functionalities and use cases.
Adoption Stage: Organizations begin to adopt AWS Glue for specific projects or departments, integrating it into their data workflows and processes.
Expansion Stage: AWS Glue usage expands across multiple teams or departments within the organization, with increased adoption for various data integration and transformation tasks.
Optimization Stage: Organizations focus on optimizing their usage of AWS Glue, fine-tuning ETL processes, improving performance, and enhancing data governance and security.
Maturity Stage: At this stage, AWS Glue is fully integrated into the organization’s data architecture, serving as a core component for data processing, integration, and analytics across the enterprise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What characterizes the initial stage of AWS Glue or similar ETL technology adoption?

A

Experimentation with basic functionalities.
Limited use cases and exploration of capabilities.
Minimal integration into existing data workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens during the adoption stage of AWS Glue or similar ETL technology?

A

Organizations begin integrating AWS Glue into specific projects or departments.
Initial use cases are identified and implemented.
Training and education on AWS Glue usage are provided to relevant teams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does AWS Glue usage expand during the expansion stage?

A

Adoption of AWS Glue extends to multiple teams or departments.
Usage expands beyond initial use cases to encompass various data integration and transformation tasks.
Integration with other AWS services and data sources increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the focus of the optimization stage in AWS Glue maturity?

A

Optimization of ETL processes for improved performance and efficiency.
Implementation of advanced features and best practices.
Emphasis on data governance, security, and compliance requirements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What characterizes the maturity stage of AWS Glue or similar ETL technology adoption?

A

Full integration into the organization’s data architecture.
AWS Glue serves as a core component for data processing, integration, and analytics.
Continuous improvement and innovation in data management practices leveraging AWS Glue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Hadoop?

A

Hadoop is an open-source framework developed by the Apache Software Foundation for distributed storage and processing of large datasets across clusters of commodity hardware.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the core components of the Hadoop ecosystem?

A

Hadoop Distributed File System (HDFS) for distributed storage.
MapReduce for distributed processing.
YARN (Yet Another Resource Negotiator) for resource management and job scheduling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some common use cases of Hadoop?

A

Large-scale batch processing of structured and unstructured data.
Log processing and analysis.
Data warehousing and ETL (Extract, Transform, Load) processes.
Machine learning and predictive analytics.

17
Q

How does Hadoop achieve scalability?

A

Hadoop distributes data across multiple nodes in a cluster, allowing for parallel processing.
It can scale horizontally by adding more nodes to the cluster as data volume grows.

18
Q

What other technologies are part of the Hadoop ecosystem?

A

Apache Hive for data warehousing and SQL-like querying.
Apache Pig for data flow scripting.
Apache HBase for real-time, NoSQL database functionality.
Apache Spark for in-memory processing and analytics.

19
Q

What is HDFS?

A

HDFS, or Hadoop Distributed File System, is a distributed file system designed to store large datasets across clusters of commodity hardware in a fault-tolerant manner.

20
Q

Describe the architecture of HDFS

A

HDFS consists of a master-slave architecture.
The master node is called the NameNode, which manages metadata and coordinates data storage and retrieval.
The slave nodes are called DataNodes, which store actual data blocks and perform read and write operations.

21
Q

How does HDFS ensure fault tolerance?

A

HDFS replicates data across multiple DataNodes to ensure fault tolerance.
By default, each data block is replicated three times across different DataNodes in the cluster.

22
Q

How does HDFS organize data?

A

Data in HDFS is stored in large files, broken down into smaller blocks (typically 128 MB or 256 MB in size).
Blocks are distributed across DataNodes in the cluster.

23
Q

What are the typical access patterns in HDFS?

A

HDFS is optimized for sequential read and write operations.
Random reads and writes are less efficient due to the distributed nature of data storage.