AWS Services Flashcards

1
Q

Amazon Kinesis Agent

A

AWS provides the Amazon Kinesis Agent, a tool that simplifies the process of consuming data from a file and streaming it to either Kinesis Data Streams or Kinesis Data Firehose. The agent, available on GitHub as a Java application, can be configured to monitor specific files and buffer the data for a customizable duration before writing it to Kinesis. It handles retries, file rotation, and checkpointing. A typical use case involves monitoring Apache web server log files, converting the records to JSON format, and streaming them to Kinesis at regular intervals for real-time analysis using Kinesis Data Analytics.

The Amazon Kinesis Agent is suitable when streaming data from separate processes like log files. However, for custom applications that emit streaming events, it is recommended to consider alternatives such as the Amazon Kinesis Producer Library (KPL) or the AWS SDK for direct integration of data streaming with the application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Amazon Kinesis Firehose

A

Amazon Kinesis Firehose simplifies the ingestion of near real-time data from streaming sources and allows writing it to various targets like Amazon S3, Amazon Redshift, Amazon Elasticsearch, and third-party services such as Splunk, Datadog, and New Relic. It enables easy data ingestion, processing, and delivery to targets like Amazon S3. For example, website clickstream data from Apache web logs can be ingested by installing the Kinesis Agent on the web server and configuring it to monitor the log files. The agent periodically writes records to the Kinesis Firehose endpoint, which buffers the data and writes it out to the specified target based on a set time interval or record size threshold. Amazon S3 also offers options for transforming the data into Parquet or ORC format or applying custom transformations using an Amazon Lambda function.

Amazon Kinesis Firehose is suitable for scenarios where streaming data needs to be received, buffered for a specific duration, and then written to supported targets. If low-latency processing of incoming data or using a custom application for data processing or unsupported services is required, alternatives such as Amazon Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka (MSK) should be considered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Amazon Kinesis Data Streams

A

Kinesis Firehose and Kinesis Data Streams offer different capabilities for data ingestion and processing. While Firehose buffers data before writing it to supported targets, Data Streams provides low-latency access to streaming applications, making data available within 70 milliseconds of being written to Kinesis. Companies like Netflix use Data Streams to ingest and enrich terabytes of log data, enabling real-time analytics on network health.

To write data to Data Streams, one can use the Kinesis Agent for simpler cases like log files, the AWS SDK for low latency, or the Amazon Kinesis Producer Library (KPL) for optimized performance and simplified tasks. Multiple options are available for reading from Data Streams, including other Kinesis services, AWS Lambda for custom code execution, or setting up an Amazon EC2 server cluster with the Kinesis Client Library (KCL) to handle complex tasks.

Kinesis Data Streams is suitable for processing incoming data in real-time or creating high-availability server clusters for custom applications. However, if the use case involves writing data to specific services in near real-time, Kinesis Firehose should be considered. Migrating an existing Apache Kafka cluster to AWS suggests using Amazon MSK, and if third-party integration provided by Apache Kafka is necessary, Amazon MSK is recommended.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Amazon Kinesis Data Analytics

A

Amazon Kinesis Data Analytics is a managed service designed to analyze streaming data using SQL or Apache Flink. It supports various data sources such as clickstream data, social media data, and IoT data. One practical application of Kinesis Data Analytics is analyzing clickstream data from an e-commerce website to gain real-time insights into product sales. For instance, an organization can assess the effectiveness of a product promotion by monitoring its impact on sales. By employing simple SQL queries on the clickstream logs, Kinesis Data Analytics facilitates the rapid retrieval of information, such as the number of sales for a specific product during each 5-minute interval since the promotion started.

Kinesis Data Analytics is an ideal solution for scenarios that involve processing streaming data and deriving timely insights. Furthermore, it serves as a suitable option for migrating existing Apache Flink applications to the cloud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Amazon Kinesis Video Streams

A

Amazon Kinesis Video Streams is a service designed for processing time-bound streams of unstructured data, including video, audio, and RADAR data. It handles the provisioning and scaling of the necessary compute infrastructure to ingest streaming video from numerous sources. Kinesis Video Streams allows for live and on-demand viewing of video content and can be seamlessly integrated with other Amazon API services to enable advanced applications like computer vision and video analytics. It simplifies the development of full-featured applications for appliances such as video doorbell systems, home security cameras, and baby monitors by providing streaming capabilities through Kinesis Video Analytics.

Kinesis Video Streams is ideal for creating applications that involve ingesting and processing streaming media data, facilitating live or on-demand playback with ease and scalability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Amazon MSK

A

Apache Kafka is a popular open-source distributed event streaming platform used for creating high-performance streaming data pipelines and applications. Amazon MSK (Managed Streaming for Apache Kafka) is AWS’s managed version of Apache Kafka, which simplifies installation, scaling, updating, and management tasks. It allows organizations to deploy an Apache Kafka cluster easily through the AWS console and automates cluster health monitoring and component replacement. Amazon MSK is ideal for replacing existing Apache Kafka clusters or leveraging third-party integrations from the Apache Kafka ecosystem. However, for new projects, Amazon Kinesis may be a better choice as it is serverless and charges based on data throughput, unlike Amazon MSK which requires paying for the cluster regardless of data usage.

The main reason that you would choose Amazon MSK over Amazon Kinesis is in cases where you are required full compatibility with Apache Kafka and need more control over your Kafka infrastructure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Amazon AppFlow

A

Amazon AppFlow allows data ingestion from popular SaaS services and facilitates data transformation and writing to common analytics targets like Amazon S3, Amazon Redshift, and Snowflake. It can also write data to certain SaaS services. For instance, AppFlow can be utilized to ingest lead data from Marketo and automatically create Salesforce contact records based on new Marketo leads. From a data engineering perspective, AppFlow enables the automatic extraction of new opportunity records from Salesforce and their storage in an S3 data lake or Redshift data warehouse, allowing for advanced analytics by combining the opportunity records with other datasets. AppFlow supports configuration for scheduled or event-triggered execution, data filtering, masking, validation, and calculations on source data fields. Although new integrations are expected in the future, the current book publication mentions the supported integrations for Amazon AppFlow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

AWS Transfer Family

A

Amazon Transfer Family is a fully managed service by AWS that enables file transfers using common protocols like FTP and SFTP directly to and from Amazon S3. It caters to organizations that still rely on these protocols for data exchange with external partners. For example, a real estate company receiving MLS files from an SFTP provider can seamlessly migrate to the managed service. By replicating the existing setup in Amazon Transfer, future transfers can be written directly to Amazon S3, making the data readily accessible for data transformation pipelines in an Amazon S3-based data lake. It is recommended for organizations currently using FTP, SFTP, or FTPS to consider transitioning to the managed version of the service provided by Amazon Transfer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AWS DataSync

A

Amazon DataSync is a service that simplifies the process of ingesting data from on-premises storage systems to AWS. It supports protocols like NFS and SMB for accessing files on different systems and can replicate data from file servers and object-based storage systems compatible with AWS S3 API calls. DataSync enables syncing data to various AWS targets, including Amazon S3, making it suitable for syncing end-of-day transactions or transferring historical data to an S3 data lake. It is ideal for ingesting current or historical data from on-premises storage systems to AWS over a network connection. However, for large historical datasets where network transfer is impractical, the Amazon Snow family of devices should be considered. If data preprocessing is required, the Amazon Kinesis Agent can be used to preprocess and send data to Amazon S3 via Amazon Kinesis Firehose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

AWS Snow Family

A

The AWS Snow family consists of ruggedized devices that offer an offline data transfer solution for cases where transferring large datasets to AWS over a network connection is impractical. These devices can be shipped to a location with limited connectivity or large datasets, allowing data to be transferred over a local network. Once the data is loaded onto the device, it is sent back to AWS for transfer to Amazon S3. The Snow devices provide data encryption at rest and some models even offer compute capabilities for edge computing use cases. This solution is particularly useful when there are challenges with network connectivity or the size of the dataset, providing a reliable and secure method to transfer data to AWS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

AWS Snowcone

A

Lightweight (4.5 lb/2.1 kg) device with 8 TB of usable storage.

AWS Snowcone is a small, rugged, and portable device offered by Amazon Web Services (AWS) that enables offline data transfer and edge computing capabilities. It is designed for use cases where there are limited network connectivity or harsh environments. Snowcone can securely transport up to 8 terabytes of data, supports data encryption, and can be easily transported to remote locations. It allows organizations to collect, process, and store data in challenging environments and later transfer the data to AWS for further processing and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

AWS Snowball

A

Mediumweight (49.7 lb/22.5kg) device with 80 TB of usable storage.

AWS Snowball is a data transfer service provided by Amazon Web Services (AWS) that facilitates the physical transfer of large amounts of data. It involves a rugged, portable device called Snowball, which can store up to 80 terabytes of data. The device is shipped to the customer’s location, where data can be easily loaded onto it. Once filled, the Snowball is returned to AWS, where the data is securely transferred to the desired AWS storage service. Snowball offers a reliable and efficient solution for transferring large datasets offline, particularly in scenarios where network transfer is impractical or time-consuming.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

AWS Snowmobile

A

AWS Snowmobile is a massive data transfer service provided by Amazon Web Services (AWS) that enables the physical transfer of exabytes of data. It involves a 45-foot long ruggedized shipping container equipped with high-security measures and a massive storage capacity. The Snowmobile is transported to the customer’s location, where petabytes or exabytes of data can be loaded onto it. Once filled, the Snowmobile is transported back to an AWS data center for secure data transfer to the desired AWS storage service. Snowmobile offers a reliable and efficient solution for transferring extremely large datasets, providing a scalable option for organizations dealing with massive data migration challenges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

AWS Lambda

A

AWS Lambda is a serverless compute service provided by Amazon Web Services (AWS) that allows you to run your code without managing servers. With Lambda, you can upload your code and create functions that are automatically triggered in response to events or API calls. Lambda scales automatically to handle the incoming workload and charges you only for the compute time consumed by your code. It offers seamless integration with other AWS services, making it easy to build scalable and event-driven applications with reduced operational overhead. Lambda is suitable for a wide range of use cases, including data processing, real-time file processing, microservices, and more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

AWS Glue

A

AWS Glue is a fully managed extract, transform, and load (ETL) service offered by Amazon Web Services (AWS). It provides a set of sub-services that work together to simplify the process of data preparation, data cataloging, and ETL workflows. With Glue, users can easily discover, catalog, and transform data from various sources using automated crawlers and a centralized metadata repository called the Glue Data Catalog. Glue also offers visual data preparation tools, ETL capabilities, and a visual interface for designing and monitoring workflows. It provides a scalable and serverless solution for efficient data integration and transformation tasks, enabling organizations to derive insights from their data more effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

AWS Glue Data Catalog

A

This is a centralized metadata repository that stores metadata about various data sources, tables, and schemas. It provides a unified view of the data across different services and helps in cataloging and discovering data.

17
Q

AWS Glue Crawlers

A

Crawlers automatically discover and infer the schema of data stored in various sources such as databases, data lakes, and data warehouses. They create metadata in the Glue Data Catalog, making it easier to work with the data.

18
Q

AWS Glue DataBrew

A

DataBrew is a visual data preparation tool that helps in cleaning, normalizing, and transforming data. It offers a no-code interface to build data transformation recipes and can be used to prepare data for analytics and machine learning.

19
Q

AWS Glue ETL

A

The ETL (Extract, Transform, Load) capabilities of AWS Glue allow users to define and run workflows for data extraction, transformation, and loading into target data stores. It provides an easy way to create and manage ETL jobs without managing the underlying infrastructure.

20
Q

AWS Glue Studio

A

Glue Studio is a visual interface for building, running, and monitoring Glue ETL jobs. It offers a code-free experience for designing data transformation workflows using a drag-and-drop interface.

21
Q

Amazon EMR

A