Practice Questions - Amazon AWS Certified Big Data - Specialty Flashcards

(85 cards)

1
Q

A company’s social media manager requests more staff on the weekends to handle an increase in customer contacts from a particular region. The company needs a report to visualize the trends on weekends over the past 6 months using QuickSight. How should the data be represented?

A. A line graph plotting customer contacts vs. time, with a line for each region
B. A pie chart per region plotting customer contacts per day of the week
C. A map of regions with a heatmap overlay to show the volume of customer contacts
D. A bar graph plotting region vs. volume of social media contacts

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A web-hosting company is building a web analytics tool to capture clickstream data from all of the websites hosted within its platform and to provide near-real-time business intelligence. This entire system is built on AWS services. The web-hosting company is interested in using Amazon Kinesis to collect this data and perform sliding window analytics. What is the most reliable and fault-tolerant technique to get each website to send data to Amazon Kinesis with every click?

A. After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis PutRecord API. Use the sessionID as a partition key and set up a loop to retry until a success response is received.
B. After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis Producer Library .addRecords method.
C. Each web server buffers the requests until the count reaches 500 and sends them to Amazon Kinesis using the Amazon Kinesis PutRecord API.
D. After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis PutRecord API. Use the exponential back-off algorithm for retries until a successful response is received.

A

B

The most reliable and fault-tolerant technique is to use the Amazon Kinesis Producer Library (KPL). Option B correctly identifies this. The KPL provides features like batching, retry mechanisms, rate limiting, and aggregation, all crucial for handling the high volume and potential errors inherent in clickstream data. Option A uses PutRecord, which lacks the built-in features of KPL for efficient and reliable delivery. Option C introduces unnecessary buffering that can delay near-real-time analytics. While option D improves on A with exponential backoff, it still lacks the advanced features of KPL. Therefore, B is the superior choice for reliability and fault tolerance in a near real-time system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A company hosts a portfolio of e-commerce websites across the Oregon, N. Virginia, Ireland, and Sydney AWS regions. Each site keeps log files that capture user behavior. The company has built an application that generates batches of product recommendations with collaborative filtering in Oregon. Oregon was selected because the flagship site is hosted there and provides the largest collection of data to train machine learning models against. The other regions DO NOT have enough historic data to train accurate machine learning models. Which set of data processing steps improves recommendations for each region?
A. Use the e-commerce application in Oregon to write replica log files in each other region.
B. Use Amazon S3 bucket replication to consolidate log entries and build a single model in Oregon.
C. Use Kinesis as a buffer for web logs and replicate logs to the Kinesis stream of a neighboring region.
D. Use the CloudWatch Logs agent to consolidate logs into a single CloudWatch Logs group.

A

A

The best solution is A. The problem states that other regions lack sufficient data to train their own models. Replicating the Oregon log files to other regions (A) provides them with the data necessary to build and improve their regional recommendation models.

Option B is incorrect because consolidating all logs into a single model in Oregon doesn’t improve regional recommendations; it creates a single, potentially less accurate model for all regions. Option C is not the most efficient approach as it only replicates logs to neighboring regions, not all regions. Option D is incorrect based on the discussion; cross-region CloudWatch Logs consolidation was not available at the time the question was created.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

An administrator needs to design a strategy for the schema in a Redshift cluster. The administrator needs to determine the optimal distribution style for the tables in the Redshift schema. In which two circumstances would choosing EVEN distribution be most appropriate? (Choose two.)
A. When the tables are highly denormalized and do NOT participate in frequent joins.
B. When data must be grouped based on a specific key on a defined slice.
C. When data transfer between nodes must be eliminated.
D. When a new table has been loaded and it is unclear how it will be joined to dimension.

A

A, D

The correct answers are A and D. EVEN distribution in Redshift distributes rows across slices in a round-robin fashion, regardless of column values. This is ideal in two scenarios:

A. When tables are highly denormalized and do not participate in frequent joins: Since data doesn’t need to be grouped based on a specific key for efficient joins, EVEN distribution avoids unnecessary data movement.

D. When a new table is loaded and its join behavior is unclear: Using EVEN distribution provides a neutral starting point until the table’s join patterns are better understood and a more optimized distribution style (KEY or ALL) can be chosen.

Option B is incorrect because it describes KEY distribution, where data is grouped based on a specific key. Option C is incorrect because eliminating data transfer between nodes is the goal of KEY distribution, not EVEN distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

An organization needs a data store to handle the following data types and access patterns:
✑ Faceting
✑ Search
✑ Flexible schema (JSON) and fixed schema
✑ Noise word elimination
Which data store should the organization choose?
A. Amazon Relational Database Service (RDS)
B. Amazon Redshift
C. Amazon DynamoDB
D. Amazon Elasticsearch Service

A

D. Amazon Elasticsearch Service

The correct answer is D because Amazon Elasticsearch Service (Amazon ES) is a fully managed service that offers all the features required by the organization. It supports faceting and search natively. It can handle both flexible (JSON) and fixed schemas. Finally, noise word elimination is a common feature in search engines like Amazon ES.

Option A (Amazon RDS) is primarily a relational database, not ideal for flexible schemas or faceting/search functionalities. Option B (Amazon Redshift) is a data warehouse optimized for analytical queries, not real-time search and faceting. Option C (Amazon DynamoDB) is a NoSQL key-value and document database; while it can handle flexible schemas, it is not designed for complex search and faceting operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

An Amazon Redshift Database is encrypted using KMS. A data engineer needs to use the AWS CLI to create a KMS encrypted snapshot of the database in another AWS region. Which three steps should the data engineer take to accomplish this task? (Choose three.)
A. Create a new KMS key in the destination region.
B. Copy the existing KMS key to the destination region.
C. Use CreateSnapshotCopyGrant to allow Amazon Redshift to use the KMS key from the source region.
D. In the source region, enable cross-region replication and specify the name of the copy grant created.
E. In the destination region, enable cross-region replication and specify the name of the copy grant created.

A

A, B, D

The correct answer is A, B, and D. To create a KMS-encrypted snapshot of an Amazon Redshift database in another region, you must first create a KMS key (or copy an existing one) in the destination region (A & B). Then, in the source region, you enable cross-region replication, specifying the newly created KMS key’s grant (D). Option C is incorrect because the snapshot is created in the destination region and thus needs the KMS key in the destination region, not the source. Option E is incorrect because cross-region replication is enabled in the source region, not the destination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

An Amazon EMR cluster using EMRFS has access to petabytes of data on Amazon S3, originating from multiple unique data sources. The customer needs to query common fields across some of the data sets to be able to perform interactive joins and then display results quickly. Which technology is most appropriate to enable this capability?
A. Presto
B. MicroStrategy
C. Pig
D. R Studio

A

A. Presto

Presto is the most appropriate technology because it is designed for interactive querying and fast joins on large datasets. The question specifically mentions the need for “interactive joins and then display results quickly,” which aligns perfectly with Presto’s capabilities. Pig is better suited for batch processing, MicroStrategy is a business intelligence tool, and R Studio is primarily for statistical computing and data analysis, making them less suitable for this specific use case requiring fast interactive querying of petabytes of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A large oil and gas company needs to provide near real-time alerts when peak thresholds are exceeded in its pipeline system. The company has developed a system to capture pipeline metrics such as flow rate, pressure, and temperature using millions of sensors. The sensors deliver to AWS IoT. What is a cost-effective way to provide near real-time alerts on the pipeline metrics?
A. Create an AWS IoT rule to generate an Amazon SNS notification.
B. Store the data points in an Amazon DynamoDB table and poll if for peak metrics data from an Amazon EC2 application.
C. Create an Amazon Machine Learning model and invoke it with AWS Lambda.
D. Use Amazon Kinesis Streams and a KCL-based application deployed on AWS Elastic Beanstalk.

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A system engineer for a company proposes digitalization and backup of large archives for customers. The systems engineer needs to provide users with a secure storage that makes sure that data will never be tampered with once it has been uploaded. How should this be accomplished?

A. Create an Amazon Glacier Vault. Specify a “Deny” Vault Lock policy on this Vault to block “glacier:DeleteArchive”.
B. Create an Amazon S3 bucket. Specify a “Deny” bucket policy on this bucket to block “s3:DeleteObject”.
C. Create an Amazon Glacier Vault. Specify a “Deny” vault access policy on this Vault to block “glacier:DeleteArchive”.
D. Create secondary AWS Account containing an Amazon S3 bucket. Grant “s3:PutObject” to the primary account.

A

A

The correct answer is A because a Vault Lock policy, unlike a Vault Access policy, cannot be modified after it’s locked. This ensures that the “glacier:DeleteArchive” action is permanently blocked, preventing any tampering with the data after upload. Option B is incorrect because S3 bucket policies, even with a “Deny” setting, can be changed. Option C is incorrect because a vault access policy can be modified, allowing for potential future tampering. Option D is incorrect as it doesn’t inherently prevent data tampering after upload; it only controls who can upload data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A clinical trial will rely on medical sensors to remotely assess patient health. Each participating physician requires visual reports each morning. These reports are built from aggregations of all sensor data collected each minute. What is the most cost-effective solution for creating this daily visualization?

A. Use Kinesis Aggregators Library to generate reports for reviewing the patient sensor data and generate a QuickSight visualization on the new data each morning for the physician to review.
B. Use a transient EMR cluster that shuts down after use to aggregate the patient sensor data each night and generate a QuickSight visualization on the new data each morning for the physician to review.
C. Use Spark streaming on EMR to aggregate the patient sensor data in every 15 minutes and generate a QuickSight visualization on the new data each morning for the physician to review.
D. Use an EMR cluster to aggregate the patient sensor data each night and provide Zeppelin notebooks that look at the new data residing on the cluster each morning for the physician to review.

A

B

The most cost-effective solution is B. Using a transient EMR cluster ensures that resources are only consumed during the nightly aggregation process. The cluster shuts down afterward, minimizing costs compared to continuously running clusters (A, C, and D). QuickSight is a cost-effective visualization tool suitable for generating daily reports for multiple physicians. Option A might be less cost effective if the Kinesis data volume is large. Options C and D involve continuously running resources (Spark streaming or a persistent EMR cluster), making them more expensive than a transient cluster. Option D also uses Zeppelin notebooks, which are less efficient for generating visualizations compared to QuickSight for multiple users.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The department of transportation for a major metropolitan area has placed sensors on roads at key locations around the city. The goal is to analyze the flow of traffic and notifications from emergency services to identify potential issues and to help planners respond to issues within 30 seconds of their occurrence. Which solution should a data engineer choose to create a scalable and fault-tolerant solution that meets this requirement?

A. Collect the sensor data with Amazon Kinesis Firehose and store it in Amazon Redshift for analysis. Collect emergency services events with Amazon SQS and store in Amazon DynamoDB for analysis.
B. Collect the sensor data with Amazon SQS and store in Amazon DynamoDB for analysis. Collect emergency services events with Amazon Kinesis Firehose and store in Amazon Redshift for analysis.
C. Collect both sensor data and emergency services events with Amazon Kinesis Streams and use DynamoDB for analysis.
D. Collect both sensor data and emergency services events with Amazon Kinesis Firehose and use Amazon Redshift for analysis.

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

An administrator needs to design a distribution strategy for a star schema in a Redshift cluster. The administrator needs to determine the optimal distribution style for the tables in the Redshift schema. In which three circumstances would choosing Key-based distribution be most appropriate? (Select three.)
A. When the administrator needs to optimize a large, slowly changing dimension table.
B. When the administrator needs to reduce cross-node traffic.
C. When the administrator needs to optimize the fact table for parity with the number of slices.
D. When the administrator needs to balance data distribution and collocation data.
E. When the administrator needs to take advantage of data locality on a local node for joins and aggregates.

A

B, D, E

The discussion reveals conflicting answers, but the consensus points to B, D, and E as the most appropriate choices for key-based distribution.

  • B. When the administrator needs to reduce cross-node traffic: Key-based distribution minimizes data movement across nodes during joins by ensuring related data resides on the same node. This is a primary benefit of this distribution style.
  • D. When the administrator needs to balance data distribution and collocation data: Key-based distribution helps balance data distribution while ensuring data needed for joins is co-located.
  • E. When the administrator needs to take advantage of data locality on a local node for joins and aggregates: This directly relates to the reduced cross-node traffic mentioned in B, making this a key advantage of key-based distribution.

Option A is incorrect because large, slowly changing dimension tables are often better suited to ALL distribution. Option C is incorrect because parity with the number of slices is generally associated with even distribution, not key-based distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A company receives data sets coming from external providers on Amazon S3. Data sets from different providers are dependent on one another. Data sets will arrive at different times and in no particular order. A data architect needs to design a solution that enables the company to do the following:
✑ Rapidly perform cross data set analysis as soon as the data becomes available
✑ Manage dependencies between data sets that arrive at different times
Which architecture strategy offers a scalable and cost-effective solution that meets these requirements?
A. Maintain data dependency information in Amazon RDS for MySQL. Use an AWS Data Pipeline job to load an Amazon EMR Hive table based on task dependencies and event notification triggers in Amazon S3.
B. Maintain data dependency information in an Amazon DynamoDB table. Use Amazon SNS and event notifications to publish data to fleet of Amazon EC2 workers. Once the task dependencies have been resolved, process the data with Amazon EMR.
C. Maintain data dependency information in an Amazon ElastiCache Redis cluster. Use Amazon S3 event notifications to trigger an AWS Lambda function that maps the S3 object to Redis. Once the task dependencies have been resolved, process the data with Amazon EMR.
D. Maintain data dependency information in an Amazon DynamoDB table. Use Amazon S3 event notifications to trigger an AWS Lambda function that maps the S3 object to the task associated with it in DynamoDB. Once all task dependencies have been resolved, process the data with Amazon EMR.

A

D

The discussion overwhelmingly supports option D as the correct answer. The reasoning centers around scalability and cost-effectiveness. DynamoDB is well-suited for storing the configuration data representing task dependencies, offering better scalability than Redis (option C) and RDS (option A). Option B, while using DynamoDB, employs SNS and EC2, which is less efficient and potentially more costly than using Lambda functions (option D) for triggering data processing. The consensus points to option D as the most scalable and cost-effective approach using readily available AWS services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A game company uses DynamoDB to support its game application. Amazon Redshift stores the past two years of historical data. Game traffic fluctuates throughout the year due to factors like seasons, movie releases, and holidays. An administrator needs to predict the required DynamoDB read and write throughput (RCU and WCU) for each week in advance. Which approach should the administrator use?

A. Feed the data into Amazon Machine Learning and build a regression model.
B. Feed the data into Spark MLlib and build a random forest model.
C. Feed the data into Apache Mahout and build a multi-classification model.
D. Feed the data into Amazon Machine Learning and build a binary classification model.

A

A

The best approach is A because Redshift contains two years of historical data, which can be used as labeled data to train a regression model. A regression model is suitable for predicting a continuous value, such as the required RCU and WCU. Options B and C are less suitable because they are focused on classification problems, not regression. Option D is also less suitable because a binary classification model would only predict whether the throughput is above or below a certain threshold, not the precise amount needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A customer is collecting clickstream data using Amazon Kinesis and is grouping the events by IP address into 5-minute chunks stored in Amazon S3. Many analysts in the company use Hive on Amazon EMR to analyze this data. Their queries always reference a single IP address. Data must be optimized for querying based on IP address using Hive running on Amazon EMR. What is the most efficient method to query the data with Hive?
A. Store an index of the files by IP address in the Amazon DynamoDB metadata store for EMRFS.
B. Store the Amazon S3 objects with the following naming scheme: bucket_name/source=ip_address/year=yy/month=mm/day=dd/hour=hh/filename.
C. Store the data in an HBase table with the IP address as the row key.
D. Store the events for an IP address as a single file in Amazon S3 and add metadata with keys: Hive_Partitioned_IPAddress.

A

B

The most efficient method is to leverage Hive’s partitioning capabilities in S3. Option B uses a naming scheme that creates partitions based on the IP address and date/time. This allows Hive to quickly locate the relevant data based on the IP address in the query, avoiding a full table scan.

Option A is incorrect because while DynamoDB can store metadata, it’s not directly integrated with Hive’s partitioning functionality in the same way that S3 is. Option C is incorrect because HBase is a NoSQL database, not optimized for Hive queries. Option D is incorrect because while metadata is helpful, it doesn’t improve query efficiency as much as partitioning the data in S3 directly via the file naming convention. Combining all events for a single IP address into one file doesn’t improve performance because Hive would still need to scan that single (potentially very large) file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A system needs to collect on-premises application spool files into a persistent storage layer in AWS. Each spool file is 2 KB. The application generates 1 million files per hour. Each source file is automatically deleted from the local server after an hour. What is the most cost-efficient option to meet these requirements?
A. Write file contents to an Amazon DynamoDB table.
B. Copy files to Amazon S3 Standard Storage.
C. Write file contents to Amazon ElastiCache.
D. Copy files to Amazon S3 infrequent Access Storage.

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A company that manufactures and sells smart air conditioning units also offers add-on services so that customers can see real-time dashboards in a mobile application or a web browser. Each unit sends its sensor information in JSON format every two seconds for processing and analysis. The company also needs to consume this data to predict possible equipment problems before they occur. A few thousand pre-purchased units will be delivered in the next couple of months. The company expects high market growth in the next year and needs to handle a massive amount of data and scale without interruption. Which ingestion solution should the company use?

A. Write sensor data records to Amazon Kinesis Streams. Process the data using KCL applications for the end-consumer dashboard and anomaly detection workflows.
B. Batch sensor data to Amazon Simple Storage Service (S3) every 15 minutes. Flow the data downstream to the end-consumer dashboard and to the anomaly detection application.
C. Write sensor data records to Amazon Kinesis Firehose with Amazon Simple Storage Service (S3) as the destination. Consume the data with a KCL application for the end-consumer dashboard and anomaly detection.
D. Write sensor data records to Amazon Relational Database Service (RDS). Build both the end-consumer dashboard and anomaly detection application on top of Amazon RDS.

A

A

The discussion highlights that while option A (using Kinesis Streams with KCL) is the best fit for real-time processing and scaling, some concerns were raised about resharding causing interruptions. However, options B, C, and D are clearly inferior. B is unsuitable for real-time dashboards due to the 15-minute batching. C incorrectly states that KCL can be used with Kinesis Firehose. D is unsuitable because RDS is not designed for the volume and velocity of data involved, and it does not scale efficiently for this use case. Therefore, while acknowledging the resharding limitations of A, it remains the most suitable option among the choices presented. The question emphasizes real-time processing and massive scaling, making A the closest fit despite the potential resharding challenges.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A media advertising company handles a large number of real-time messages sourced from over 200 websites. The company’s data engineer needs to collect and process records in real time for analysis using Spark Streaming on Amazon Elastic MapReduce (EMR). The data engineer needs to fulfill a corporate mandate to keep ALL raw messages as they are received as a top priority. Which Amazon Kinesis configuration meets these requirements?

A. Publish messages to Amazon Kinesis Firehose backed by Amazon Simple Storage Service (S3). Pull messages off Firehose with Spark Streaming in parallel to persistence to Amazon S3.

B. Publish messages to Amazon Kinesis Streams. Pull messages off Streams with Spark Streaming in parallel to AWS Lambda pushing messages from Streams to Firehose backed by Amazon Simple Storage Service (S3).

C. Publish messages to Amazon Kinesis Firehose backed by Amazon Simple Storage Service (S3). Use AWS Lambda to pull messages from Firehose to Streams for processing with Spark Streaming.

D. Publish messages to Amazon Kinesis Streams, pull messages off with Spark Streaming, and write row data to Amazon Simple Storage Service (S3) before and after processing.

A

D

The discussion highlights that option D is the most suitable because it allows Spark Streaming to directly read from Kinesis Streams, fulfilling the real-time processing requirement. The raw data is written to S3 both before and after processing, ensuring that all raw messages are preserved as mandated.

Option A is incorrect because Spark Streaming cannot directly read from Kinesis Firehose. Option B is inefficient and adds unnecessary complexity by using Lambda as an intermediary. Option C is also incorrect because it introduces a delay by routing data through Firehose and Lambda before reaching Spark Streaming, violating the real-time requirement and potentially leading to data loss if the process isn’t perfectly reliable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A new algorithm has been written in Python to identify SPAM e-mails. The algorithm analyzes the free text contained within a sample set of 1 million e-mails stored on Amazon S3. The algorithm must be scaled across a production dataset of 5 PB, which also resides in Amazon S3 storage. Which AWS service strategy is best for this use case?

A. Copy the data into Amazon ElastiCache to perform text analysis on the in-memory data and export the results of the model into Amazon Machine Learning.
B. Use Amazon EMR to parallelize the text analysis tasks across the cluster using a streaming program step.
C. Use Amazon Elasticsearch Service to store the text and then use the Python Elasticsearch Client to run analysis against the text index.
D. Initiate a Python job from AWS Data Pipeline to run directly against the Amazon S3 text files.

A

B

The best option is B because Amazon EMR (Elastic MapReduce) is designed for processing massive datasets like the 5 PB of email data described. EMR allows for parallel processing using Hadoop or Spark, which are ideal for distributing the computational load across a cluster. This efficiently handles the scale of the data and the computational demands of the text analysis algorithm.

Option A is incorrect because ElastiCache is an in-memory data store, not designed for processing 5 PB of data. Option C is incorrect because Elasticsearch, while useful for indexing and searching, is not ideal for processing this scale of data; it has storage limitations. Option D is incorrect because while Data Pipeline can manage jobs, it doesn’t inherently provide the parallel processing capabilities necessary for efficiently handling such a large dataset. Directly processing the data in S3 using a single Python job would be extremely slow and inefficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

An administrator needs to design the event log storage architecture for events from mobile devices. The event data will be processed by an Amazon EMR cluster daily for aggregated reporting and analytics before being archived. How should the administrator recommend storing the log data?
A. Create an Amazon S3 bucket and write log data into folders by device. Execute the EMR job on the device folders.
B. Create an Amazon DynamoDB table partitioned on the device and sorted on date, write log data to table. Execute the EMR job on the Amazon DynamoDB table.
C. Create an Amazon S3 bucket and write data into folders by day. Execute the EMR job on the daily folder.
D. Create an Amazon DynamoDB table partitioned on EventID, write log data to table. Execute the EMR job on the table.

A

C

The best approach is to store the log data in an Amazon S3 bucket, organized into folders by day (C). This aligns with the daily processing requirement of the EMR cluster. EMR is efficient at processing many objects from a single S3 folder in parallel, making daily aggregation straightforward. Options A, B, and D are less efficient: A is less efficient because processing by device would require many separate EMR jobs; B and D use DynamoDB, which is designed for low-latency reads and writes, not batch processing like this scenario requires, and incurs higher costs compared to S3. The discussion highlights that partitioning in S3 is not necessary given EMR’s ability to handle parallel processing of many objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

There are thousands of text files on Amazon S3. The total size of the files is 1 PB. The files contain retail order information for the past 2 years. A data engineer needs to run multiple interactive queries to manipulate this data. The Data Engineer has AWS access to spin up an Amazon EMR cluster. The data engineer needs to use an application on the cluster to process this data and return the results in an interactive timeframe. Which application on the cluster should the data engineer use?

A. Oozie
B. Apache Pig with Tachyon
C. Apache Hive
D. Presto

A

D

The best answer is Presto because the question emphasizes the need for interactive query processing. Presto is designed for fast query execution and low latency, making it ideal for interactive analysis of large datasets.

While Apache Hive is suitable for large-scale batch processing, it’s not optimized for interactive queries. Oozie is a workflow scheduler, not a query engine. Apache Pig with Tachyon is better suited for ETL (Extract, Transform, Load) processes and data manipulation rather than interactive querying, although Tachyon’s in-memory capabilities could improve performance. The primary requirement here is interactive querying, which makes Presto the most appropriate choice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Company A operates in Country X and maintains a large dataset of historical purchase orders containing personal data (full names and telephone numbers) in five 1TB text files. This data is stored on-premises due to legal requirements. The R&D department needs to run a clustering algorithm using Amazon EMR in the closest AWS region, with a minimum latency of 200ms between the on-premises system and the AWS region. Which option allows Company A to perform clustering in the AWS Cloud while complying with the legal requirement of keeping personal data within Country X?

A. Anonymize the personal data, transfer the files to Amazon S3 in the AWS region, and have the EMR cluster read the data using EMRFS.
B. Establish a Direct Connect link between the on-premises system and the AWS region to reduce latency, and have the EMR cluster read the data directly from the on-premises storage system over Direct Connect.
C. Encrypt the data files according to Country X’s encryption standards, store them in Amazon S3 in the AWS region, and have the EMR cluster read the data using EMRFS.
D. Use AWS Import/Export Snowball to transfer the data to the AWS region, copy the files to an EBS volume, and have the EMR cluster read the data using EMRFS.

A

B

The correct answer is B because it is the only option that keeps the data on-premises, thereby fulfilling the legal requirement of storing personal data in Country X. Options A, C, and D all involve transferring the data to an AWS region, violating the legal requirement. While options C and D involve encryption, encryption alone does not satisfy the requirement of keeping the data within Country X. Option A’s anonymization also doesn’t address the legal requirement; the act of anonymization itself may be a violation. Option B directly addresses the problem by utilizing Direct Connect to access the data without moving it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

A customer has an Amazon S3 bucket. Objects are uploaded simultaneously by a cluster of servers from multiple streams of data. The customer maintains a catalog of objects uploaded in Amazon S3 using an Amazon DynamoDB table. This catalog has the following fields: StreamName, TimeStamp, and ServerName, from which ObjectName can be obtained. The customer needs to define the catalog to support querying for a given stream or server within a defined time range. Which DynamoDB table scheme is most efficient to support these queries?

A. Define a Primary Key with ServerName as Partition Key and TimeStamp as Sort Key. Do NOT define a Local Secondary Index or Global Secondary Index.
B. Define a Primary Key with StreamName as Partition Key and TimeStamp followed by ServerName as Sort Key. Define a Global Secondary Index with ServerName as partition key and TimeStamp followed by StreamName.
C. Define a Primary Key with ServerName as Partition Key. Define a Local Secondary Index with StreamName as Partition Key. Define a Global Secondary Index with TimeStamp as Partition Key.
D. Define a Primary Key with ServerName as Partition Key. Define a Local Secondary Index with TimeStamp as Partition Key. Define a Global Secondary Index with StreamName as Partition Key and TimeStamp as Sort Key.

A

A

The most efficient solution is A. Option A allows for efficient queries by ServerName within a time range (using the TimeStamp sort key). This directly addresses the customer’s requirement to query by server within a time range. Options B, C, and D introduce unnecessary indexes and complexities. While they might allow for querying by StreamName, they don’t offer a significant advantage over option A and add overhead. The “or” in the customer’s requirement implies that querying by either StreamName or ServerName is acceptable, not necessarily both simultaneously. Therefore, A, focusing on efficient queries by ServerName (the more likely frequent query, given the simultaneous uploads from multiple servers), is the optimal choice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

A company has several teams of analysts. Each team of analysts has their own cluster. The teams need to run SQL queries using Hive, Spark-SQL, and Presto with Amazon EMR. The company needs to enable a centralized metadata layer to expose the Amazon S3 objects as tables to the analysts. Which approach meets the requirement for a centralized metadata layer?
A. EMRFS consistent view with a common Amazon DynamoDB table
B. Bootstrap action to change the Hive Metastore to an Amazon RDS database
C. s3distcp with the outputManifest option to generate RDS DDL
D. Naming scheme support with automatic partition discovery from Amazon S3

A

B

The correct answer is B because it directly addresses the need for a centralized metadata layer. A centralized Hive metastore in Amazon RDS allows all analysts, regardless of their cluster, to access the same metadata, effectively exposing S3 objects as tables.

Option A is incorrect because EMRFS consistent view focuses on data consistency, not on providing a centralized metadata layer for querying S3 objects as tables. Option C is incorrect because s3distcp is a data transfer tool, not a metadata management solution. Option D relies on naming conventions and automatic partition discovery, which is insufficient for a truly centralized and robust metadata layer; it’s less reliable and scalable than a dedicated metastore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
A data engineer is running a data warehouse (DWH) on a 25-node Redshift cluster for a SaaS service. Five large customers represent 80% of usage, while dozens of smaller customers make up the remaining 20%. A dashboarding tool has been selected. How should the data engineer ensure that the larger customer workloads do NOT interfere with the smaller customer workloads? A. Apply query filters based on customer-id that cannot be changed by the user and apply distribution keys on customer-id. B. Place the largest customers into a single user group with a dedicated query queue and place the rest of the customers into a different query queue. C. Push aggregations into an RDS for Aurora instance. Connect the dashboard application to Aurora rather than Redshift for faster queries. D. Route the largest customers to a dedicated Redshift cluster. Raise the concurrency of the multi-tenant Redshift cluster to accommodate the remaining customers.
B The best solution is to use Redshift's query queueing system to separate large and small customers. Option B suggests creating separate query queues for large and small customers, ensuring that resource contention is minimized. Large customers get dedicated resources, preventing them from impacting smaller customers. Option A is incorrect because while query filters and distribution keys improve query performance, they don't guarantee resource separation between large and small customers. Large queries could still consume disproportionate resources. Option C is incorrect because using RDS for Aurora is inefficient for this scenario. Aggregating data in Aurora and then fetching it adds latency and doesn't solve the core problem of resource contention in the Redshift cluster. Option D is a more complex and costly solution. While dedicating a separate Redshift cluster for large customers would resolve the issue, it's not the most efficient approach, especially if the 80/20 split of resource usage isn't consistently maintained. Raising the concurrency of the main cluster for the remaining 20% of customers isn't necessary if the existing resources are already sufficient to handle them.
26
An Amazon Kinesis stream needs to be encrypted. Which approach should be used to accomplish this task? A. Perform a client-side encryption of the data before it enters the Amazon Kinesis stream on the producer. B. Use a partition key to segment the data by MD5 hash function, which makes it undecipherable while in transit. C. Perform a client-side encryption of the data before it enters the Amazon Kinesis stream on the consumer. D. Use a shard to segment the data, which has built-in functionality to make it indecipherable while in transit.
A The correct answer is A because client-side encryption on the producer side is the standard and recommended way to encrypt data before it enters an Amazon Kinesis stream. Options B and D are incorrect because partitioning and sharding are data organization mechanisms, not encryption methods. They do not provide encryption. Option C is incorrect because encrypting on the consumer side does not protect the data while it's in transit within the Kinesis stream; encryption must happen *before* the data enters the stream.
27
An online retailer is using Amazon DynamoDB to store data related to customer transactions. The items in the table contain several string attributes describing the transaction as well as a JSON attribute containing the shopping cart and other details corresponding to the transaction. Average item size is 250KB, most of which is associated with the JSON attribute. The average customer generates 3GB of data per month. Customers access the table to display their transaction history and review transaction details as needed. Ninety percent of the queries against the table are executed when building the transaction history view, with the other 10% retrieving transaction details. The table is partitioned on CustomerID and sorted on transaction date. The client has very high read capacity provisioned for the table and experiences very even utilization, but complains about the cost of Amazon DynamoDB compared to other NoSQL solutions. Which strategy will reduce the cost associated with the client's read queries while not degrading quality? A. Modify all database calls to use eventually consistent reads and advise customers that transaction history may be one second out-of-date. B. Change the primary table to partition on TransactionID, create a GSI partitioned on customer and sorted on date, project small attributes into GSI, and then query GSI for summary data and the primary table for JSON details. C. Vertically partition the table, store base attributes on the primary table, and create a foreign key reference to a secondary table containing the JSON data. Query the primary table for summary data and the secondary table for JSON details. D. Create an LSI sorted on date, project the JSON attribute into the index, and then query the primary table for summary data and the LSI for JSON details.
A The discussion strongly suggests A is the correct answer. Options B, C, and D are all deemed incorrect for the following reasons: * **B:** DynamoDB doesn't allow changing a primary key after table creation. Even if it did, the suggested approach wouldn't necessarily reduce RCU and WCU significantly, as the load would be distributed between the base table and GSI. * **C:** DynamoDB does not support foreign keys. * **D:** LSIs cannot be created after table creation, and the table already uses date as a sort key. Furthermore, there are size limitations on LSIs. Option A, using eventually consistent reads, is the most viable solution because it directly reduces the read capacity units (RCU) required, thus lowering the cost. The one-second delay in data consistency is acceptable for the transaction history use case, as described in the problem.
28
Managers in a company need access to the human resources database that runs on Amazon Redshift, to run reports about their employees. Managers must only see information about their direct reports. Which technique should be used to address this requirement with Amazon Redshift? A. Define an IAM group for each manager with each employee as an IAM user in that group, and use that to limit the access. B. Use Amazon Redshift snapshot to create one cluster per manager. Allow the manager to access only their designated clusters. C. Define a key for each manager in AWS KMS and encrypt the data for their employees with their private keys. D. Define a view that uses the employee’s manager name to filter the records based on current user names.
D
29
An organization uses Amazon Elastic MapReduce (EMR) to process a series of extract-transform-load (ETL) steps that run in sequence. The output of each step must be fully processed in subsequent steps but will not be retained. Which of the following techniques will meet this requirement most efficiently? A. Use the EMR File System (EMRFS) to store the outputs from each step as objects in Amazon Simple Storage Service (S3). B. Use the s3n URI to store the data to be processed as objects in Amazon S3. C. Define the ETL steps as separate AWS Data Pipeline activities. D. Load the data to be processed into HDFS, and then write the final output to Amazon S3.
C The most efficient technique is to define the ETL steps as separate AWS Data Pipeline activities (option C). AWS Data Pipeline is designed for managing and orchestrating ETL workflows, ensuring sequential processing. Options A, B, and D involve managing the data flow and dependencies manually, which is less efficient and more error-prone than using a dedicated workflow management service like AWS Data Pipeline. Option A uses EMRFS and S3, which doesn't inherently enforce sequential processing. Option B similarly uses S3, implying manual management of dependencies. Option D uses HDFS, which adds complexity and overhead for this specific use case where intermediate outputs are not persisted. Therefore, AWS Data Pipeline offers the most efficient and reliable solution for this sequential ETL process where intermediate data is not retained.
30
A telecommunications company needs to predict customer churn (i.e., customers who decide to switch to a competitor). The company has historic records of each customer, including monthly consumption patterns, calls to customer service, and whether the customer ultimately quit the service. All of this data is stored in Amazon S3. The company needs to know which customers are likely going to churn soon so that they can win back their loyalty. What is the optimal approach to meet these requirements? A. Use the Amazon Machine Learning service to build a binary classification model based on the dataset stored in Amazon S3. The model will be used regularly to predict the churn attribute for existing customers. B. Use AWS QuickSight to connect it to data stored in Amazon S3 to obtain the necessary business insight. Plot the churn trend graph to extrapolate churn likelihood for existing customers. C. Use EMR to run Hive queries to build a profile of a churning customer. Apply this profile to existing customers to determine the likelihood of churn. D. Use a Redshift cluster to COPY the data from Amazon S3. Create a User Defined Function in Redshift that computes the likelihood of churn.
A The optimal approach is to use Amazon Machine Learning (A) to build a binary classification model. This is because predicting customer churn is inherently a prediction problem best suited to machine learning algorithms. A binary classification model can effectively learn patterns from the historical data to predict whether a customer will churn or not. Option B (AWS QuickSight) is unsuitable because while it can visualize data, it cannot perform predictive modeling. Option C (EMR and Hive) is less optimal because while it can analyze data, it requires manual feature engineering and model building which is less efficient than a purpose-built machine learning service. Option D (Redshift) is also less optimal as it focuses on data warehousing and processing rather than predictive modeling. While a UDF could be created, it would require significant manual development compared to using a pre-built machine learning service.
31
A solutions architect works for a company that has a data lake based on a central Amazon S3 bucket. The data contains sensitive information. The architect must be able to specify exactly which files each user can access. Users access the platform through a SAML federation Single Sign On platform. The architect needs to build a solution that allows fine-grained access control, traceability of access to the objects, and usage of the standard tools (AWS Console, AWS CLI) to access the data. Which solution should the architect build? A. Use Amazon S3 Server-Side Encryption with AWS KMS-Managed Keys for storing data. Use AWS KMS Grants to allow access to specific elements of the platform. Use AWS CloudTrail for auditing. B. Use Amazon S3 Server-Side Encryption with Amazon S3-Managed Keys. Set Amazon S3 ACLs to allow access to specific elements of the platform. Use Amazon S3 to access logs for auditing. C. Use Amazon S3 Client-Side Encryption with Client-Side Master Key. Set Amazon S3 ACLs to allow access to specific elements of the platform. Use Amazon S3 to access logs for auditing. D. Use Amazon S3 Client-Side Encryption with AWS KMS-Managed Keys for storing data. Use AWS KMS Grants to allow access to specific elements of the platform. Use AWS CloudTrail for auditing.
B The correct answer is B because it uses server-side encryption with S3-managed keys, providing encryption and key management handled by AWS. S3 Access Control Lists (ACLs) allow for fine-grained access control to specify which users can access specific files. Using S3 access logs provides traceability of access. Options A and D incorrectly rely on AWS KMS grants for access control, which is less granular and more complex than using S3 ACLs for this specific scenario. Option C is incorrect because client-side encryption requires the client to manage the encryption keys, which negates the ease of use with standard AWS tools and introduces operational complexity. The question requires a solution using standard tools, and client-side encryption would not satisfy this requirement.
32
A city has been collecting data on its public bicycle share program for the past three years. The 5PB dataset currently resides on Amazon S3. The data contains the following datapoints: • Bicycle origination points • Bicycle destination points • Mileage between the points • Number of bicycle slots available at the station (which is variable based on the station location) • Number of slots available and taken at a given time The program has received additional funds to increase the number of bicycle stations available. All data is regularly archived to Amazon Glacier. The new bicycle stations must be located to provide the most riders access to bicycles. How should this task be performed? A. Move the data from Amazon S3 into Amazon EBS-backed volumes and use an EC-2 based Hadoop cluster with spot instances to run a Spark job that performs a stochastic gradient descent optimization. B. Use the Amazon Redshift COPY command to move the data from Amazon S3 into Redshift and perform a SQL query that outputs the most popular bicycle stations. C. Persist the data on Amazon S3 and use a transient EMR cluster with spot instances to run a Spark streaming job that will move the data into Amazon Kinesis. D. Keep the data on Amazon S3 and use an Amazon EMR-based Hadoop cluster with spot instances to run a Spark job that performs a stochastic gradient descent optimization over EMRFS.
B The best approach is to use Amazon Redshift. The question asks to identify locations for *new* bicycle stations to maximize rider access. While option D suggests using machine learning (stochastic gradient descent) which could theoretically be applied, it is an overkill. The existing data already contains information on station popularity ("Number of slots available and taken at a given time"). A simple SQL query in Redshift can identify the most popular stations (highest number of slots taken) which are strong candidates for expansion. Options A and C involve unnecessary data movement and complexities. Option D, while leveraging EMR and potentially offering more scalability, introduces unnecessary complexity for this problem. Option B provides a direct, efficient, and cost-effective solution.
33
A solutions architect for a logistics organization ships packages from thousands of suppliers to end customers. The architect is building a platform where suppliers can view the status of one or more of their shipments. Each supplier can have multiple roles that will only allow access to specific fields in the resulting information. Which strategy allows the appropriate level of access control and requires the LEAST amount of management work? A. Send the tracking data to Amazon Kinesis Streams. Use AWS Lambda to store the data in an Amazon DynamoDB Table. Generate temporary AWS credentials for the suppliers users with AWS STS, specifying fine-grained security policies to limit access only to their applicable data. B. Send the tracking data to Amazon Kinesis Firehose. Use Amazon S3 notifications and AWS Lambda to prepare files in Amazon S3 with appropriate data for each suppliers roles. Generate temporary AWS credentials for the suppliers users with AWS STS. Limit access to the appropriate files through security policies. C. Send the tracking data to Amazon Kinesis Streams. Use Amazon EMR with Spark Streaming to store the data in HBase. Create one table per supplier. Use HBase Kerberos integration with the suppliers users. Use HBase ACL-based security to limit access for the roles to their specific table and columns. D. Send the tracking data to Amazon Kinesis Firehose. Store the data in an Amazon Redshift cluster. Create views for the suppliers users and roles. Allow suppliers access to the Amazon Redshift cluster using a user limited to the applicable view.
A The discussion overwhelmingly supports option A as the best solution. The key reasons are: * **Fine-grained access control:** DynamoDB, combined with AWS STS and IAM policies, offers fine-grained access control (FGAC), allowing precise specification of which data each supplier role can access. This directly addresses the requirement for limiting access to specific fields. * **Least management overhead:** Creating and managing thousands of views (option D), tables (option C), or files (option B) would be significantly more complex than managing IAM policies for DynamoDB. DynamoDB's FGAC combined with STS reduces the management burden substantially. Option B is less efficient due to the overhead of creating and managing numerous files in S3. Option C is overly complex and requires significant infrastructure setup and maintenance with HBase, Kerberos, and ACL management. Option D is also inefficient because managing thousands of views in Redshift would be time-consuming and complex.
34
An organization uses a custom MapReduce application on an Amazon EMR cluster to build monthly reports from many small data files in an Amazon S3 bucket. Data submission from various business units is frequent but unpredictable. Scaling the EMR cluster has not solved performance issues as the dataset grows. The organization needs to improve performance with minimal changes to existing processes and applications. What action should the organization take? A. Use Amazon S3 Event Notifications and AWS Lambda to create a quick search file index in DynamoDB. B. Add Spark to the Amazon EMR cluster and utilize Resilient Distributed Datasets in-memory. C. Use Amazon S3 Event Notifications and AWS Lambda to index each file into an Amazon Elasticsearch Service cluster. D. Schedule a daily AWS Data Pipeline process that aggregates content into larger files using S3DistCp. E. Have business units submit data via Amazon Kinesis Firehose to aggregate data hourly into Amazon S3.
D The correct answer is D because S3DistCp is specifically designed for efficiently merging many small files into larger files, directly addressing the performance bottleneck caused by processing numerous small files. This requires minimal changes to the existing process, as it only involves scheduling a daily aggregation task. Option A is incorrect because creating a search index in DynamoDB doesn't directly improve the performance of the MapReduce job itself. Option B is incorrect because while Spark and Resilient Distributed Datasets can improve performance, it requires significant code changes to the existing MapReduce application. Option C is incorrect as it involves indexing into Elasticsearch, which is irrelevant to the core MapReduce processing. Option E would change the data ingestion pipeline significantly, requiring modification to the business units' submission process.
35
An administrator is processing events in near real-time using Kinesis streams and Lambda. Lambda intermittently fails to process batches from one of the shards due to a 5-minute time limit. What is a possible solution for this problem? A. Add more Lambda functions to improve concurrent batch processing. B. Reduce the batch size that Lambda is reading from the stream. C. Ignore and skip events that are older than 5 minutes and put them to Dead Letter Queue (DLQ). D. Configure Lambda to read from fewer shards in parallel.
B The correct answer is B because the problem stems from Lambda exceeding its 5-minute time limit while processing batches. Reducing the batch size will decrease the processing time for each invocation, making it less likely to hit the time limit. Option A is incorrect because adding more Lambda functions doesn't address the root cause—the large batch size causing timeouts. Option C is incorrect because ignoring events and sending them to the DLQ doesn't solve the processing issue; it only hides the symptom. Option D is incorrect because each Kinesis shard already invokes a separate Lambda function. Reducing the number of shards a single Lambda handles is not possible in this architecture. The solution lies in adjusting the batch size within the Lambda function configuration.
36
A data engineer wants to use an Amazon Elastic Map Reduce (EMR) application and ensure it complies with regulatory requirements. The auditor must be able to confirm at any point which servers are running and which network access controls are deployed. Which action should the data engineer take to meet this requirement? A. Provide the auditor IAM accounts with the SecurityAudit policy attached to their group. B. Provide the auditor with SSH keys for access to the Amazon EMR cluster. C. Provide the auditor with CloudFormation templates. D. Provide the auditor with access to AWS DirectConnect to use their existing tools.
A
37
A medical record filing system for a government medical fund is using an Amazon S3 bucket to archive documents related to patients. Every patient visit to a physician creates a new file, which can add up to millions of files each month. Collection of these files from each physician is handled via a batch process that runs every night using AWS Data Pipeline. This is sensitive data, so the data and any associated metadata must be encrypted at rest. Auditors review some files on a quarterly basis to see whether the records are maintained according to regulations. Auditors must be able to locate any physical file in the S3 bucket for a given date, patient, or physician. Auditors spend a significant amount of time locating such files. What is the most cost- and time-efficient collection methodology in this situation? A. Use Amazon Kinesis to get the data feeds directly from physicians, batch them using a Spark application on Amazon Elastic MapReduce (EMR), and then store them in Amazon S3 with folders separated per physician. B. Use Amazon API Gateway to get the data feeds directly from physicians, batch them using a Spark application on Amazon Elastic MapReduce (EMR), and then store them in Amazon S3 with folders separated per physician. C. Use Amazon S3 event notification to populate an Amazon DynamoDB table with metadata about every file loaded to Amazon S3, and partition them based on the month and year of the file. D. Use Amazon S3 event notification to populate an Amazon Redshift table with metadata about every file loaded to Amazon S3, and partition them based on the month and year of the file.
C The correct answer is C because it provides the most cost-effective and time-efficient solution for the auditors' needs. DynamoDB is a NoSQL database optimized for fast key-value lookups, making it ideal for quickly searching metadata based on patient, date, or physician. Partitioning by month and year further improves query performance. Option A and B are incorrect because EMR, while suitable for large-scale batch processing, isn't optimized for the quick, ad-hoc searches required by auditors. Searching through millions of files in S3 based on patient or date would be extremely time-consuming and inefficient. Option D is incorrect because Redshift is a data warehouse optimized for analytical queries on large datasets, not for the frequent, low-latency lookups required by auditors. Using Redshift would be an overkill and less cost-effective than DynamoDB for this use case.
38
An organization needs to design and deploy a large-scale data storage solution that will be highly durable and highly flexible with respect to the type and structure of data being stored. The data to be stored will be sent or generated from a variety of sources and must be persistently available for access and processing by multiple applications. What is the most cost-effective technique to meet these requirements? A. Use Amazon Simple Storage Service (S3) as the actual data storage system, coupled with appropriate tools for ingestion/acquisition of data and for subsequent processing and querying. B. Deploy a long-running Amazon Elastic MapReduce (EMR) cluster with Amazon Elastic Block Store (EBS) volumes for persistent HDFS storage and appropriate Hadoop ecosystem tools for processing and querying. C. Use Amazon Redshift with data replication to Amazon Simple Storage Service (S3) for comprehensive durable data storage, processing, and querying. D. Launch an Amazon Relational Database Service (RDS), and use the enterprise grade and capacity of the Amazon Aurora engine for storage, processing, and querying.
A The discussion overwhelmingly supports option A as the most cost-effective solution. While the suggested answer is C, the community feedback strongly indicates that using S3 (option A) with appropriate tools for data ingestion, processing, and querying is a more cost-effective approach than the alternatives. Option B (EMR with EBS) is likely more expensive due to the continuous cluster operation and management overhead. Option C (Redshift with S3 replication) introduces the cost and complexity of Redshift, while option D (RDS with Aurora) is designed for relational data, not the diverse and unstructured data mentioned in the problem statement. S3's pay-as-you-go model aligns best with the need for a flexible and cost-effective solution for large-scale storage of various data types.
39
A company needs a churn prevention model to predict which customers will NOT renew their yearly subscription to the company's service. A binary classification model using Amazon Machine Learning is required. On which basis should this binary classification model be built? A. User profiles (age, gender, income, occupation) B. Last user session C. Each user's time series events in the past 3 months D. Quarterly results
C
40
An organization is currently using an Amazon EMR long-running cluster with the latest Amazon EMR release for analytic jobs and is storing data as external tables on Amazon S3. The company needs to launch multiple transient EMR clusters to access the same tables concurrently, but the metadata about the Amazon S3 external tables are defined and stored on the long-running cluster. Which solution will expose the Hive metastore with the LEAST operational effort? A. Export Hive metastore information to Amazon DynamoDB, configuring the hive-site classification to point to the Amazon DynamoDB table. B. Export Hive metastore information to a MySQL table on Amazon RDS and configure the Amazon EMR hive-site classification to point to the Amazon RDS database. C. Launch an Amazon EC2 instance, install and configure Apache Derby, and export the Hive metastore information to Derby. D. Create and configure an AWS Glue Data Catalog as a Hive metastore for Amazon EMR.
D The correct answer is D because using AWS Glue Data Catalog as a Hive metastore offers the least operational effort for sharing metadata across multiple transient EMR clusters. The discussion highlights that using RDS for the metastore is not recommended for concurrent writes, and setting up a self-managed solution like Derby (option C) requires significant operational overhead. AWS Glue is designed for this purpose, offering scalability and ease of management compared to other options. Option A, using DynamoDB, is also less efficient than Glue for this use case.
41
An Operations team continuously monitors the number of visitors to a website to identify any potential system problems. The number of website visitors varies throughout the day. The site is more popular in the middle of the day and less popular at night. Which type of dashboard display would be the MOST useful to allow staff to quickly and correctly identify system problems? A. A vertical stacked bar chart showing today's website visitors and the historical average number of website visitors. B. An overlay line chart showing today's website visitors at one-minute intervals and also the historical average number of website visitors. C. A single KPI metric showing the statistical variance between the current number of website visitors and the historical number of website visitors for the current time of day. D. A scatter plot showing today's website visitors on the X-axis and the historical average number of website visitors on the Y-axis.
B The best option is B because it provides a visual representation of the website visitor trend over time, allowing for quick identification of deviations from the historical average. This is crucial for spotting potential system problems. Option A (stacked bar chart) is less effective because it lacks the granular detail of a line chart and doesn't easily show the trends over time. Option C (KPI metric) only shows a single number, providing no visual context or trend information and making it hard to judge whether a deviation is significant or simply normal fluctuation. Option D (scatter plot) is also less suitable as it does not directly show the time-series trend needed to identify problems. While a sudden spike or drop might be visible, the temporal aspect is not clearly represented.
42
An organization currently runs a large Hadoop environment in their data center and is in the process of creating an alternative Hadoop environment on AWS, using Amazon EMR. They generate around 20 TB of data on a monthly basis. Also on a monthly basis, files need to be grouped and copied to Amazon S3 to be used for the Amazon EMR environment. They have multiple S3 buckets across AWS accounts to which data needs to be copied. There is a 10G AWS Direct Connect setup between their data center and AWS, and the network team has agreed to allocate 50% of AWS Direct Connect bandwidth to data transfer. The data transfer cannot take more than two days. What would be the MOST efficient approach to transfer data to AWS on a monthly basis? A. Use an offline copy method, such as an AWS Snowball device, to copy and transfer data to Amazon S3. B. Configure a multipart upload for Amazon S3 on AWS Java SDK to transfer data over AWS Direct Connect. C. Use Amazon S3 transfer acceleration capability to transfer data over AWS Direct Connect. D. Setup S3DistCp tool on the on-premises Hadoop environment to transfer data to Amazon S3 over AWS Direct Connect.
D The most efficient approach is to use the S3DistCp tool. S3DistCp is designed for parallel copying of large datasets from Hadoop Distributed File System (HDFS) to Amazon S3, making it ideal for this scenario. It leverages the Hadoop framework's ability to handle large files and distribute the transfer work across multiple nodes, which significantly accelerates the process. This directly addresses the requirement of transferring 20TB of data within a 2-day timeframe. Option A (Snowball) is too slow for this volume of data and the available bandwidth. Option B (Java SDK multipart upload) is less efficient than a Hadoop-optimized tool like S3DistCp for moving large datasets. Option C (S3 Transfer Acceleration) is beneficial for high-latency transfers, but given the Direct Connect link and the need for direct Hadoop integration, it's not as efficient as S3DistCp.
43
An organization is developing a mobile social application and needs to collect logs from all devices on which it is installed. The organization is evaluating Amazon Kinesis Data Streams to push logs and Amazon EMR to process data. They want to store data on HDFS using the default replication factor to replicate data among the cluster, but they are concerned about the durability of the data. Currently, they are producing 300 GB of raw data daily, with additional spikes during special events. They will need to scale out the Amazon EMR cluster to match the increase in streamed data. Which solution prevents data loss and matches compute demand? A. Use multiple Amazon EBS volumes on Amazon EMR to store processed data and scale out the Amazon EMR cluster as needed. B. Use the EMR File System and Amazon S3 to store processed data and scale out the Amazon EMR cluster as needed. C. Use Amazon DynamoDB to store processed data and scale out the Amazon EMR cluster as needed. D. Use Amazon Kinesis Data Firehose and, instead of using Amazon EMR, stream logs directly into Amazon Elasticsearch Service.
B The best solution is B because it addresses both data durability and scalability concerns. Amazon S3 offers high durability and redundancy, preventing data loss. Using the EMR File System allows the EMR cluster to access the data stored in S3, providing the required compute resources for processing. The ability to scale out the EMR cluster accommodates the variable data volume, including spikes. Option A is incorrect because while EBS volumes offer persistence, they lack the scalability of S3 for handling large datasets and spikes in data volume. Option C is incorrect because DynamoDB is a NoSQL database optimized for low-latency access, not for large-scale batch processing and storage of large log files. Option D is incorrect because it removes the EMR cluster, which is a requirement in the problem statement. While Kinesis Firehose and Elasticsearch are good for log management, they don't fulfill the requirement of using EMR for data processing.
44
An advertising organization uses an application to process a stream of events received from clients in multiple unstructured formats. The application transforms the events into a single structured format and streams them to Amazon Kinesis for real-time analysis. It also stores the unstructured raw events from the log files on local hard drivers that are rotated and uploaded to Amazon S3. The organization wants to extract campaign performance reporting using an existing Amazon Redshift cluster. Which solution will provide the performance data with the LEAST number of operations? A. Install the Amazon Kinesis Data Firehose agent on the application servers and use it to stream the log files directly to Amazon Redshift. B. Create an external table in Amazon Redshift and point it to the S3 bucket where the unstructured raw events are stored. C. Write an AWS Lambda function that triggers every hour to load the new log files already in S3 to Amazon Redshift. D. Connect Amazon Kinesis Data Firehose to the existing Amazon Kinesis stream and use it to stream the event directly to Amazon Redshift.
D The best solution is D because it leverages the already structured data stream in Amazon Kinesis. Option A is incorrect because it attempts to load unstructured log files directly into Redshift, which is inefficient and likely to fail. Option B is incorrect because querying unstructured data in S3 using Redshift Spectrum is inefficient. Option C is incorrect because it adds an extra step (Lambda processing) when a more direct method exists. Option D provides the least number of operations by directly streaming the already structured data from Kinesis to Redshift.
45
An organization needs to store sensitive information on Amazon S3 and process it through Amazon EMR. Data must be encrypted on Amazon S3 and Amazon EMR at rest and in transit. Using Thrift Server, the Data Analysis team uses Hive to interact with this data. The organization would like to grant access to only specific databases and tables, giving permission only to the SELECT statement. Which solution will protect the data and limit user access to the SELECT statement on a specific portion of data? A. Configure Transparent Data Encryption on Amazon EMR. Create an Amazon EC2 instance and install Apache Ranger. Configure the authorization on the cluster to use Apache Ranger. B. Configure data encryption at rest for EMR File System (EMRFS) on Amazon S3. Configure data encryption in transit for traffic between Amazon S3 and EMRFS. Configure storage and SQL base authorization on HiveServer2. C. Use AWS KMS for encryption of data. Configure and attach multiple roles with different permissions based on the different user needs. D. Configure Security Group on Amazon EMR. Create an Amazon VPC endpoint for Amazon S3. Configure HiveServer2 to use Kerberos authentication on the cluster.
A The discussion reveals that option A is the correct answer. Option B is incorrect because, while it addresses encryption, EMR doesn't support Hive authorization for EMRFS and S3. Option C, while utilizing KMS for encryption, doesn't specifically address the granular access control needed at the database and table level for SELECT statements only. Option D focuses on network security and authentication but doesn't provide the necessary fine-grained authorization controls. Option A correctly leverages Apache Ranger, which provides the capability to implement fine-grained access control at the database and table levels, restricting access to only the SELECT statement as required.
46
An organization's data warehouse contains sales data for reporting purposes. Data governance policies prohibit staff from accessing customers' credit card numbers. How can these policies be adhered to and still allow a Data Scientist to group transactions that use the same credit card number? A. Store a cryptographic hash of the credit card number. B. Encrypt the credit card number with a symmetric encryption key, and give the key only to the authorized Data Scientist. C. Mask the credit card numbers to only show the last four digits of the credit card number. D. Encrypt the credit card number with an asymmetric encryption key and give the decryption key only to the authorized Data Scientist.
A The correct answer is A because a cryptographic hash function will produce the same output for the same input. This allows the Data Scientist to group transactions based on the identical hash value, representing the same credit card number, without ever having access to the actual credit card number itself. The process is one-way, meaning the original credit card number cannot be recovered from the hash. Option B is incorrect because it violates data governance policies by giving the Data Scientist access to the decryption key, which would allow them to access the actual credit card numbers. Option C is incorrect because masking only the last four digits is insufficient to prevent the disclosure of credit card numbers. There's a high likelihood of collisions (different cards sharing the same last four digits). Option D is also incorrect for the same reason as option B; it still requires giving the Data Scientist access to sensitive information (the decryption key).
47
A real-time bidding company is rebuilding their monolithic application and is focusing on serving real-time data. A large number of reads and writes are generated from thousands of concurrent users who follow items and bid on the company's sale offers. The company is experiencing high latency during special event spikes, with millions of concurrent users. The company needs to analyze and aggregate a part of the data in near real time to feed an internal dashboard. What is the BEST approach for serving and analyzing data, considering the constraint of the low latency on the highly demanded data? A. Use Amazon Aurora with Multi Availability Zone and read replicas. Use Amazon ElastiCache in front of the read replicas to serve read-only content quickly. Use the same database as the datasource for the dashboard. B. Use Amazon DynamoDB to store real-time data with Amazon DynamoDB Accelerator to serve content quickly. Use Amazon DynamoDB Streams to replay all changes to the table, process and stream to Amazon Elasticsearch Service with AWS Lambda. C. Use Amazon RDS with Multi Availability Zone. Provisioned IOPS EBS volume for storage. Enable up to five read replicas to serve read-only content quickly. Use Amazon EMR with Sqoop to import Amazon RDS data into HDFS for analysis. D. Use Amazon Redshift with a DC2 node type and a multi-node cluster. Create an Amazon EC2 instance with pgpool installed. Create an Amazon ElastiCache cluster and route read requests through pgpool, and use Amazon Redshift for analysis.
B The best approach is B because it leverages services designed for high throughput and low latency, crucial for real-time bidding. DynamoDB is a NoSQL database optimized for high-performance reads and writes, and DynamoDB Accelerator further enhances performance. DynamoDB Streams provide a mechanism for capturing data changes in near real-time, which are then processed by Lambda and sent to Elasticsearch for dashboarding. This entire pipeline is well-suited for the requirements of near real-time analysis. Option A uses Aurora, which, while scalable, may not offer the same level of low-latency performance as DynamoDB, especially under extreme load. The use of ElastiCache is helpful, but the underlying database remains a potential bottleneck. Option C uses RDS, which is less performant than DynamoDB for this use case. Using EMR and Sqoop for analysis introduces significant latency, making it unsuitable for near real-time dashboarding. Option D uses Redshift, a data warehouse designed for analytical queries, not real-time data serving. The use of pgpool and ElastiCache adds complexity without addressing the core latency issue stemming from Redshift’s inherent limitations in real-time processing. The latency introduced by using Redshift for analysis is too high for the near real-time dashboarding requirement.
48
A gas company needs to monitor gas pressure in their pipelines. Pressure data is streamed from sensors placed throughout the pipelines to monitor the data in real time. When an anomaly is detected, the system must send a notification to open a valve. An Amazon Kinesis stream collects the data from the sensors and an anomaly Kinesis stream triggers an AWS Lambda function to open the appropriate valve. Which solution is the MOST cost-effective for responding to anomalies in real time? A. Attach a Kinesis Firehose to the stream and persist the sensor data in an Amazon S3 bucket. Schedule an AWS Lambda function to run a query in Amazon Athena against the data in Amazon S3 to identify anomalies. When a change is detected, the Lambda function sends a message to the anomaly stream to open the valve. B. Launch an Amazon EMR cluster that uses Spark Streaming to connect to the Kinesis stream and Spark machine learning to detect anomalies. When a change is detected, the Spark application sends a message to the anomaly stream to open the valve. C. Launch a fleet of Amazon EC2 instances with a Kinesis Client Library application that consumes the stream and aggregates sensor data over time to identify anomalies. When an anomaly is detected, the application sends a message to the anomaly stream to open the valve. D. Create a Kinesis Analytics application by using the RANDOM_CUT_FOREST function to detect an anomaly. When the anomaly score that is returned from the function is outside of an acceptable range, a message is sent to the anomaly stream to open the valve.
D The discussion strongly suggests that option D is the most cost-effective solution. Options A, B, and C involve more complex and resource-intensive architectures. Option A uses a batch processing approach (Athena) which is not suitable for real-time anomaly detection. Option B requires managing an entire EMR cluster, which is significantly more expensive than other options. Option C involves managing a fleet of EC2 instances, increasing operational complexity and cost. Option D leverages Kinesis Analytics with the RANDOM_CUT_FOREST function, a purpose-built solution for real-time anomaly detection within the Kinesis ecosystem, making it the most cost-effective choice.
49
A gaming organization is developing a new game and would like to offer real-time competition to their users. The data architecture has the following characteristics: * The game application is writing events directly to Amazon DynamoDB from the user's mobile device. * Users from the website can access their statistics directly from DynamoDB. * The game servers are accessing DynamoDB to update the user's information. * The data science team extracts data from DynamoDB for various applications. The engineering team has already agreed to the IAM roles and policies to use for the data science team and the application. Which actions will provide the MOST security, while maintaining the necessary access to the website and game application? (Choose two.) A. Use Amazon Cognito user pool to authenticate to both the website and the game application. B. Use IAM identity federation to authenticate to both the website and the game application. C. Create an IAM policy with PUT permission for both the website and the game application. D. Create an IAM policy with fine-grained permission for both the website and the game application. E. Create an IAM policy with PUT permission for the game application and an IAM policy with GET permission for the website.
A, D
50
An organization has 10,000 devices that generate 100 GB of telemetry data per day, with each record size around 10 KB. Each record has 100 fields, and one field consists of unstructured log data with a "String" data type in the English language. Some fields are required for the real-time dashboard, but all fields must be available for long-term generation. The organization also has 10 PB of previously cleaned and structured data, partitioned by Date, in a SAN that must be migrated to AWS within one month. Currently, the organization does not have any real-time capabilities in their solution. Because of storage limitations in the on-premises data warehouse, selective data is loaded while generating the long-term trend with ANSI SQL queries through JDBC for visualization. In addition to the one-time data loading, the organization needs a cost-effective and real-time solution. How can these requirements be met? (Choose two.) A. use AWS IoT to send data from devices to an Amazon SQS queue, create a set of workers in an Auto Scaling group and read records in batch from the queue to process and save the data. Fan out to an Amazon SNS queue attached with an AWS Lambda function to filter the request dataset and save it to Amazon Elasticsearch Service for real-time analytics. B. Create a Direct Connect connection between AWS and the on-premises data center and copy the data to Amazon S3 using S3 Acceleration. Use Amazon Athena to query the data. C. Use AWS IoT to send the data from devices to Amazon Kinesis Data Streams with the IoT rules engine. Use one Kinesis Data Firehose stream attached to a Kinesis stream to batch and stream the data partitioned by date. Use another Kinesis Firehose stream attached to the same Kinesis stream to filter out the required fields to ingest into Elasticsearch for real-time analytics. D. Use AWS IoT to send the data from devices to Amazon Kinesis Data Streams with the IoT rules engine. Use one Kinesis Data Firehose stream attached to a Kinesis stream to stream the data into an Amazon S3 bucket partitioned by date. Attach an AWS Lambda function with the same Kinesis stream to filter out the required fields for ingestion into Amazon DynamoDB for real-time analytics. E. use multiple AWS Snowball Edge devices to transfer data to Amazon S3, and use Amazon Athena to query the data.
B, D The best approach involves two key aspects: migrating the existing 10 PB of data and creating a real-time solution for the new data. B is the correct choice for migrating the existing data. Using Direct Connect for high-bandwidth transfer and S3 Acceleration to optimize the transfer to S3 is efficient for a large dataset like 10 PB. Athena then provides cost-effective querying of the data in S3. Option E, using Snowball Edge, is too slow for a one-month deadline given the data volume. D is the best choice for the real-time solution. It leverages Kinesis Data Streams for high-throughput ingestion of the device data, and Kinesis Firehose streams the data to S3 for long-term storage while simultaneously routing a subset of fields to DynamoDB for real-time analytics via a Lambda function. DynamoDB is ideal for the real-time needs due to its low latency and scalability. Option A, using SQS and an Auto Scaling group, is significantly slower and less efficient than Kinesis for this volume. Option C, while using Kinesis, uses Elasticsearch, which might not be as cost-effective or scalable for this volume of unstructured data compared to DynamoDB. The use of two firehose streams in C is also less efficient than the single stream and lambda function in D.
51
An Amazon Redshift Database is encrypted using KMS. A data engineer needs to use the AWS CLI to create a KMS encrypted snapshot of the database in another AWS region. Which three steps should the data engineer take to accomplish this task? (Choose three.) A. Create a new KMS key in the destination region. B. Copy the existing KMS key to the destination region. C. Use CreateSnapshotCopyGrant to allow Amazon Redshift to use the KMS key from the source region. D. In the source region, enable cross-region replication and specify the name of the copy grant created. E. In the destination region, enable cross-region replication and specify the name of the copy grant created. F. Use CreateSnapshotCopyGrant to allow Amazon Redshift to use the KMS key created in the destination region.
A, F, E The correct answer is A, F, and E because: * **A. Create a new KMS key in the destination region:** A new KMS key is needed in the destination region to encrypt the snapshot created there. The source region's KMS key cannot directly encrypt data in another region. * **F. Use CreateSnapshotCopyGrant to allow Amazon Redshift to use the KMS key created in the destination region:** This grant explicitly allows Redshift in the destination region to use the newly created KMS key for encryption. * **E. In the destination region, enable cross-region replication and specify the name of the copy grant created:** Cross-region replication needs to be enabled in the *destination* region, pointing to the newly created KMS key and grant, to allow the snapshot copy to be made and encrypted correctly. The other options are incorrect because: * **B. Copy the existing KMS key to the destination region:** While you can copy KMS keys, it's not necessary or the most efficient method. Creating a new key in the destination region is better practice. * **C. Use CreateSnapshotCopyGrant to allow Amazon Redshift to use the KMS key from the source region:** This would be trying to use a key from a different region, which is not possible for direct encryption of the snapshot in the destination region. * **D. In the source region, enable cross-region replication and specify the name of the copy grant created:** Cross-region replication is configured in the destination region, not the source.
52
How should an Administrator BEST architect a large multi-layer Long Short-Term Memory (LSTM) recurrent neural network (RNN) running with MXNET on Amazon EC2? (Choose two.) A. Use data parallelism to partition the workload over multiple devices and balance the workload within the GPUs. B. Use compute-optimized EC2 instances with an attached elastic GPU. C. Use general purpose GPU computing instances such as G3 and P3. D. Use processing parallelism to partition the workload over multiple storage devices and balance the workload within the GPUs.
A, C
53
A company uses Amazon Redshift for its enterprise data warehouse. A new on-premises PostgreSQL OLTP database must be integrated into the data warehouse. Each table in the PostgreSQL database has an indexed timestamp column. The data warehouse has a staging layer to load source data into the data warehouse environment for further processing. The data lag between the source PostgreSQL database and the Amazon Redshift staging layer should NOT exceed four hours. What is the most efficient technique to meet these requirements? A. Create a DBLINK on the source DB to connect to Amazon Redshift. Use a PostgreSQL trigger on the source table to capture the new insert/update/delete event and execute the event on the Amazon Redshift staging table. B. Use a PostgreSQL trigger on the source table to capture the new insert/update/delete event and write it to Amazon Kinesis Streams. Use a KCL application to execute the event on the Amazon Redshift staging table. C. Extract the incremental changes periodically using a SQL query. Upload the changes to multiple Amazon Simple Storage Service (S3) objects, and run the COPY command to load to the Amazon Redshift staging layer. D. Extract the incremental changes periodically using a SQL query. Upload the changes to a single Amazon Simple Storage Service (S3) object, and run the COPY command to load to the Amazon Redshift staging layer.
C The most efficient method is C because it leverages the high-speed COPY command in Redshift for loading data from S3. Option A is incorrect because DBLINKs are generally less efficient than other methods and are not designed to work seamlessly with on-premises databases. Option B, while possible, introduces additional complexity and overhead with Kinesis and KCL, making it less efficient. Option D is incorrect because loading all changes into a single S3 object would hinder the parallel loading capabilities of the COPY command and would be inefficient for larger datasets and multiple tables. Using multiple S3 objects (one per table, or a logical grouping of tables) allows for parallel loading into Redshift, significantly improving efficiency.
54
An administrator tries to use the Amazon Machine Learning service to classify social media posts that mention the administrator's company into posts that require a response and posts that do not. The training dataset of 10,000 posts contains the details of each post including the timestamp, author, and full text of the post. The administrator is missing the target labels that are required for training. Which Amazon Machine Learning model is the most appropriate for this task? A. Binary classification model, where the target class is the require-response post B. Binary classification model, where the two classes are the require-response post and does-not-require-response C. Multi-class prediction model, with two classes: require-response post and does-not-require-response D. Regression model where the predicted value is the probability that the post requires a response
D The most appropriate model is a regression model. The discussion highlights that options A and B are incorrect because the administrator lacks the necessary target labels for supervised learning (binary classification requires labeled data). Option C is incorrect as it's a type of classification which also needs labeled data. Option D, a regression model predicting the probability of a response, is the only viable choice when target labels are missing; it can learn patterns from the text data without pre-existing classifications. While not explicitly stated as the *most* appropriate, it is the only option feasible given the constraints.
55
An organization is using Amazon Kinesis Data Streams to collect data generated from thousands of temperature devices and is using AWS Lambda to process the data. Devices generate 10 to 12 million records every day, but Lambda is processing only around 450 thousand records. Amazon CloudWatch indicates that throttling on Lambda is not occurring. What should be done to ensure that all data is processed? (Choose two.) A. Increase the BatchSize value on the EventSource, and increase the memory allocated to the Lambda function. B. Decrease the BatchSize value on the EventSource, and increase the memory allocated to the Lambda function. C. Create multiple Lambda functions that will consume the same Amazon Kinesis stream. D. Increase the number of vCores allocated for the Lambda function. E. Increase the number of shards on the Amazon Kinesis stream.
A, E
56
An organization has added a clickstream to their website to analyze traffic. The website is sending each page request with the PutRecord API call to an Amazon Kinesis stream by using the page name as the partition key. During peak spikes in website traffic, a support engineer notices many events in the application logs: `ProvisionedThroughputExcededException`. What should be done to resolve the issue in the MOST cost-effective way? A. Create multiple Amazon Kinesis streams for page requests to increase the concurrency of the clickstream. B. Increase the number of shards on the Kinesis stream to allow for more throughput to meet the peak spikes in traffic. C. Modify the application to use the Kinesis Producer Library to aggregate requests before sending them to the Kinesis stream. D. Attach more consumers to the Kinesis stream to process records in parallel, improving the performance on the stream.
C The most cost-effective solution is C. The `ProvisionedThroughputExcededException` indicates that the Kinesis stream's throughput is insufficient to handle the peak traffic. While increasing shards (B) would increase throughput, it's not cost-effective because it permanently increases costs, even when traffic is low. Creating multiple streams (A) adds complexity and doesn't address the root cause of bursts in requests. Adding more consumers (D) improves processing speed but doesn't increase the amount of data the stream can accept. Using the Kinesis Producer Library (KPL) (C) allows for batching requests, significantly reducing the number of API calls and thus the load on the stream, making it the most cost-effective solution for handling peak traffic spikes. KPL also offers features like automatic retries and buffering, increasing reliability.
57
An administrator needs to manage a large catalog of items from various external sellers. The administrator needs to determine if the items should be identified as minimally dangerous, dangerous, or highly dangerous based on their textual descriptions. The administrator already has some items with the danger attribute, but receives hundreds of new item descriptions every day without such classification. The administrator has a system that captures dangerous goods reports from customer support team or from user feedback. What is a cost-effective architecture to solve this issue? A. Build a set of regular expression rules that are based on the existing examples, and run them on the DynamoDB Streams as every new item description is added to the system. B. Build a Kinesis Streams process that captures and marks the relevant items in the dangerous goods reports using a Lambda function once more than two reports have been filed. C. Build a machine learning model to properly classify dangerous goods and run it on the DynamoDB Streams as every new item description is added to the system. D. Build a machine learning model with binary classification for dangerous goods and run it on the DynamoDB Streams as every new item description is added to the system.
C The best answer is C because the problem requires classifying items into three categories (minimally dangerous, dangerous, highly dangerous), which is a multi-class classification problem. A machine learning model is well-suited for this task, and processing the data stream from DynamoDB ensures real-time classification of new items. Option A (regular expressions) is likely insufficient for the complexity and nuance of textual descriptions related to danger levels. Option B focuses on reports rather than proactively classifying new items. Option D is incorrect because it proposes binary classification, which only allows for two categories, while the problem requires three.
58
An organization is designing a public web application and has a requirement that states all application users must be centrally authenticated before any operations are permitted. The organization will need to create a user table with fast data lookup for the application in which a user can read only his or her own data. All users already have an account with amazon.com. How can these requirements be met? A. Create an Amazon RDS Aurora table, with Amazon_ID as the primary key. The application uses amazon.com web identity federation to get a token that is used to assume an IAM role from AWS STS. Use IAM database authentication by using the rds:db-tag IAM authentication policy and GRANT Amazon RDS row-level read permission per user. B. Create an Amazon RDS Aurora table, with Amazon_ID as the primary key for each user. The application uses amazon.com web identity federation to get a token that is used to assume an IAM role. Use IAM database authentication by using rds:db-tag IAM authentication policy and GRANT Amazon RDS row-level read permission per user. C. Create an Amazon DynamoDB table, with Amazon_ID as the partition key. The application uses amazon.com web identity federation to get a token that is used to assume an IAM role from AWS STS in the Role, use IAM condition context key dynamodb:LeadingKeys with IAM substitution variables $ {www.amazon.com:user_id} and allow the required DynamoDB API operations in IAM JSON policy Action element for reading the records. D. Create an Amazon DynamoDB table, with Amazon_ID as the partition key. The application uses amazon.com web identity federation to assume an IAM role from AWS STS in the Role, use IAM condition context key dynamodb:LeadingKeys with IAM substitution variables $ {www.amazon.com:user_id} and allow the required DynamoDB API operations in IAM JSON policy Action element for reading the records.
C The discussion strongly suggests that option C is the correct answer. The key reasons are: * **Web Identity Federation and Tokens:** Option C explicitly mentions the use of a token obtained through Amazon.com web identity federation, which is crucial for central authentication and authorization. Options A, B, and D either lack this explicit mention or are less clear in their description. * **DynamoDB Suitability:** While RDS Aurora (options A and B) is a viable database choice, DynamoDB (options C and D) is better suited for fast data lookup, especially for a large number of users, which is implied by the public web application requirement. DynamoDB's scalability is a significant advantage. * **IAM Role and Policy:** Option C correctly outlines the use of an IAM role, leveraging the `dynamodb:LeadingKeys` context key within an IAM policy to restrict access to only the user's own data. This provides the necessary row-level security. Option D is similar but is missing the explicit token acquisition mentioned in option C. Options A and B incorrectly focus on RDS which isn't as efficient in this context. Options A and B are incorrect because they use RDS Aurora which is not as scalable or efficient for this specific use case as DynamoDB. Option D is incorrect because it misses the explicit mention of acquiring a token via web identity federation, a critical aspect for central authentication.
59
An enterprise customer is migrating 50 TB of data to a Redshift cluster and is considering using dense storage nodes. Their query patterns involve many joins with thousands of rows. They have a limited budget and want to avoid unnecessary testing. Which approach should they use to determine the number of nodes needed in their Redshift cluster? A. Start with many small nodes. B. Start with fewer large nodes. C. Have two separate clusters with a mix of small and large nodes. D. Insist on performing multiple tests to determine the optimal configuration.
A. Start with many small nodes. The discussion indicates that starting with many smaller nodes (e.g., ds2.xlarge) is more cost-effective than starting with fewer larger nodes (e.g., ds2.8xlarge) for a 50TB dataset with high query complexity involving many joins. While fewer large nodes might seem initially appealing, the increased computational power may be wasted if it's not needed, making it a more expensive option. Option C is inefficient and unnecessary, and Option D contradicts the customer's desire to avoid testing. The provided calculations in the discussion support the selection of many smaller nodes for better cost optimization given the data volume and query characteristics.
60
A company is building a new application in AWS. The architect needs to design a system to collect application log events. The design should be a repeatable pattern that minimizes data loss if an application instance fails, and keeps a durable copy of the log data for at least 30 days. What is the simplest architecture that will allow the architect to analyze the logs? A. Write them directly to a Kinesis Firehose. Configure Kinesis Firehose to load the events into an Amazon Redshift cluster for analysis. B. Write them to a file on Amazon Simple Storage Service (S3). Write an AWS Lambda function that runs in response to the S3 event to load the events into Amazon Elasticsearch Service for analysis. C. Write them to the local disk and configure the Amazon CloudWatch Logs agent to load the data into CloudWatch Logs and subsequently into Amazon Elasticsearch Service. D. Write them to CloudWatch Logs and use an AWS Lambda function to load them into HDFS on an Amazon Elastic MapReduce (EMR) cluster for analysis.
C C is the simplest solution because it requires only configuration of the CloudWatch Logs agent and a CloudWatch Logs subscription to Amazon Elasticsearch Service. This avoids the need for custom code (unlike options B and D) and directly addresses the requirement of minimizing data loss by writing to the local disk first, before sending to the more durable CloudWatch Logs. Option A is incorrect because writing directly to Redshift is not an efficient or typical logging approach. Option B requires additional coding and is more complex than option C. Option D introduces unnecessary complexity by using EMR, which is not a storage service but a processing platform.
61
A Redshift data warehouse has different user teams that need to query the same table with very different query types. These user teams are experiencing poor performance. Which action improves performance for the user teams in this situation? A. Create custom table views. B. Add interleaved sort keys per team. C. Maintain team-specific copies of the table. D. Add support for workload management queue hopping.
D The correct answer is D because workload management (WLM) queue hopping allows for manual adjustment of query execution priorities and resource allocation. Different user teams with varying query types and performance needs benefit from the ability to move queries between queues to optimize resource utilization and improve overall performance. Option A is incorrect because creating custom views doesn't inherently improve query performance; they only change how data is accessed, not the underlying query execution. Option B is incorrect because adding interleaved sort keys per team is impractical and not scalable; altering sort keys is difficult and inefficient. Option C is incorrect because maintaining team-specific copies of the table is inefficient, wasteful of storage, and doesn't address the root cause of performance issues. It's a brute-force solution that's generally not recommended.
62
A company operates an international business served from a single AWS region. The company wants to expand into a new country. The regulator for that country requires the Data Architect to maintain a log of financial transactions in the country within 24 hours of the product transaction. The production application is latency insensitive. The new country contains another AWS region. What is the most cost-effective way to meet this requirement? A. Use CloudFormation to replicate the production application to the new region. B. Use Amazon CloudFront to serve application content locally in the country; Amazon CloudFront logs will satisfy the requirement. C. Continue to serve customers from the existing region while using Amazon Kinesis to stream transaction data to the regulator. D. Use Amazon S3 cross-region replication (CRR) to copy and persist production transaction logs to a bucket in the new country's region.
D The most cost-effective solution is to use Amazon S3 cross-region replication (CRR) to copy transaction logs to a bucket in the new region. S3 CRR is a relatively inexpensive service, and the replication happens within a timeframe (seconds to minutes) that easily meets the 24-hour requirement. Option A is incorrect because replicating the entire application is significantly more expensive and complex than just replicating logs. Option B is incorrect because CloudFront is a content delivery network; it doesn't log financial transactions. Option C is incorrect because while Kinesis could stream the data, it would likely be more expensive than S3 CRR, and would require additional infrastructure to process and store the data in the new region to meet the regulatory requirement.
63
Multiple rows in an Amazon Redshift table were accidentally deleted. A System Administrator is restoring the table from the most recent snapshot. The snapshot contains all rows that were in the table before the deletion. What is the SIMPLEST solution to restore the table without impacting users? A. Restore the snapshot to a new Amazon Redshift cluster, then UNLOAD the table to Amazon S3. In the original cluster, TRUNCATE the table, then load the data from Amazon S3 by using a COPY command. B. Use the Restore Table from a Snapshot command and specify a new table name. DROP the original table, then RENAME the new table to the original table name. C. Restore the snapshot to a new Amazon Redshift cluster. Create a DBLINK between the two clusters in the original cluster, TRUNCATE the destination table, then use an INSERT command to copy the data from the new cluster. D. Use the ALTER TABLE REVERT command and specify a time stamp of immediately before the data deletion. Specify the Amazon Resource Name of the snapshot as the SOURCE and use the OVERWRITE REPLACE option.
B The discussion overwhelmingly supports option B as the simplest solution. Options A and C involve extra steps like unloading/loading data to S3 or creating a DBLINK and inserting data between clusters, thus increasing complexity and potential downtime. Option D, while potentially valid, is not as straightforward as directly restoring the table from a snapshot using a new table name. Option B efficiently restores the data to a new table, then replaces the old one, minimizing user impact.
64
An organization is soliciting public feedback through a web portal and is using Amazon QuickSight to visualize data from an Amazon RDS database. Management wants to understand important metrics about feedback and how this feedback has changed over the last four weeks in a visual representation. What would be the MOST effective way to represent multiple iterations of an analysis in Amazon QuickSight to show how the data has changed over the last four weeks? A. Use the analysis option for data captured in each week and view the data by a date range. B. Use a pivot table as a visual option to display measured values and weekly aggregate data as a row dimension. C. Use a dashboard option to create an analysis of the data for each week and apply filters to visualize the data change. D. Use a story option to preserve multiple iterations of an analysis and play the iterations sequentially.
D The correct answer is D because a QuickSight story allows for the sequential presentation of multiple analyses, effectively showing the data's evolution over the four weeks. This directly addresses the requirement to visualize the change in feedback over time. Option A is incorrect because while it allows viewing data by date range, it doesn't inherently provide a sequential, comparative view of the changes across weeks in a single, easily digestible presentation. Option B is incorrect as a pivot table is primarily for data aggregation and summarization; it doesn't offer a built-in mechanism to visually showcase the *change* over time as effectively as a story. Option C is incorrect because a dashboard displays multiple visualizations simultaneously, but doesn't inherently provide a sequential narrative of how the data changed over the four weeks. While filters could be used, the process would be cumbersome compared to using a story.
65
A media advertising company handles a large number of real-time messages sourced from over 200 websites in real time. Processing latency must be kept low. Based on calculations, a 60-shard Amazon Kinesis stream is more than sufficient to handle the maximum data throughput, even with traffic spikes. The company also uses an Amazon Kinesis Client Library (KCL) application running on Amazon Elastic Compute Cloud (EC2) managed by an Auto Scaling group. Amazon CloudWatch indicates an average of 25% CPU and a modest level of network traffic across all running servers. The company reports a 150% to 200% increase in latency of processing messages from Amazon Kinesis during peak times. There are NO reports of delay from the sites publishing to Amazon Kinesis. What is the appropriate solution to address the latency? A. Increase the number of shards in the Amazon Kinesis stream to 80 for greater concurrency. B. Increase the size of the Amazon EC2 instances to increase network throughput. C. Increase the minimum number of instances in the Auto Scaling group. D. Increase Amazon DynamoDB throughput on the checkpoint table.
D The discussion indicates that D is the correct answer. The latency increase is happening during the processing of messages from Kinesis, not during ingestion into Kinesis. The Kinesis Client Library (KCL) application uses DynamoDB to store checkpoints. Increased throughput on the DynamoDB checkpoint table would improve the performance of the KCL application by reducing the time it takes to write checkpoints, which directly addresses the latency issue described. Option A is incorrect because the problem isn't with Kinesis's throughput; the stream already has sufficient shards. Option B is incorrect because the EC2 instances are underutilized (25% CPU), indicating that increasing their size would not improve processing speed. Option C is also incorrect because while scaling up EC2 instances could help, the problem is not a lack of capacity; it's specifically a bottleneck in checkpointing to DynamoDB.
66
An online photo album app has a key design feature to support multiple screens (e.g, desktop, mobile phone, and tablet) with high-quality displays. Multiple versions of the image must be saved in different resolutions and layouts. The image-processing Java program takes an average of five seconds per upload, depending on the image size and format. Each image upload captures the following image metadata: user, album, photo label, upload timestamp. The app should support the following requirements: ✑ Hundreds of user image uploads per second ✑ Maximum image upload size of 10 MB ✑ Maximum image metadata size of 1 KB ✑ Image displayed in optimized resolution in all supported screens no later than one minute after image upload Which strategy should be used to meet these requirements? A. Write images and metadata to Amazon Kinesis. Use a Kinesis Client Library (KCL) application to run the image processing and save the image output to Amazon S3 and metadata to the app repository DB. B. Write image and metadata to RDS with BLOB data type. Use AWS Data Pipeline to run the image processing and save the image output to Amazon S3 and metadata to the app repository DB. C. Upload image with metadata to Amazon S3, use Lambda function to run the image processing and save the images output to Amazon S3 and metadata to the app repository DB. D. Write image and metadata to Amazon Kinesis. Use Amazon Elastic MapReduce (EMR) with Spark Streaming to run image processing and save the images output to Amazon S3 and metadata to app repository DB.
C The discussion indicates that option C is the best solution due to the limitations of other options. Option A is incorrect because Kinesis has a 1MB record size limit, which is smaller than the maximum image size of 10MB. Option B is incorrect because using RDS for image storage is inefficient and slow for this scale of uploads. Option D is also unsuitable because EMR with Spark Streaming is an overkill for this use case and introduces unnecessary complexity and latency. Option C leverages S3 for storage, Lambda for processing, and a database for metadata, which is a common and efficient approach for handling high-volume image uploads and processing. The Lambda function processes the images asynchronously, allowing for quick uploads and meeting the one-minute display requirement.
67
A customer needs to determine the optimal distribution strategy for the ORDERS fact table in its Redshift schema. The ORDERS table has foreign key relationships with multiple dimension tables in this schema. How should the company determine the most appropriate distribution key for the ORDERS table? A. Identify the largest and most frequently joined dimension table and ensure that it and the ORDERS table both have EVEN distribution. B. Identify the largest dimension table and designate the key of this dimension table as the distribution key of the ORDERS table. C. Identify the smallest dimension table and designate the key of this dimension table as the distribution key of the ORDERS table. D. Identify the largest and the most frequently joined dimension table and designate the key of this dimension table as the distribution key of the ORDERS table.
D The correct answer is D because in a star schema, the fact table (ORDERS) is often joined with multiple dimension tables. To optimize query performance, the distribution key should be the foreign key referencing the largest and most frequently joined dimension table. This ensures data locality and reduces the amount of data that needs to be transferred between nodes during joins. Option A is incorrect because while even distribution is desirable, it's secondary to selecting the correct dimension table's key. Option B is incorrect because it only considers the size of the dimension table, not the frequency of joins. Option C is incorrect because choosing the smallest dimension table is likely to result in poor query performance due to increased data movement.
68
A company generates a large number of files each month and needs to use AWS Import/Export to move these files into Amazon S3 storage. To satisfy the auditors, the company needs to keep a record of which files were imported into Amazon S3. What is a low-cost way to create a unique log for each import job? A. Use the same log file prefix in the import/export manifest files to create a versioned log file in Amazon S3 for all imports. B. Use the log file prefix in the import/export manifest files to create a unique log file in Amazon S3 for each import. C. Use the log file checksum in the import/export manifest files to create a unique log file in Amazon S3 for each import. D. Use a script to iterate over files in Amazon S3 to generate a log after each import/export job.
B The correct answer is B because using a unique log file prefix for each import job ensures that each import generates a separate log file in S3. This directly addresses the requirement to keep a record of each import. Option A is incorrect because using the same prefix creates a single, potentially overwritten log file, making it impossible to track individual imports. Option C is incorrect because while checksums are unique to file content, they aren't designed for creating unique filenames for logs and may be computationally expensive. Option D is incorrect because it is not a low-cost solution and adds extra complexity compared to leveraging the built-in features of AWS Import/Export. It also means that the log is created after the import and may not be completely reliable if an error occurred during import itself.
69
An online gaming company uses DynamoDB to store user activity logs and is experiencing throttled writes on their DynamoDB table. The company is NOT consuming close to the provisioned capacity. The table contains a large number of items and is partitioned on user and sorted by date. The table is 200GB and is currently provisioned at 10K WCU and 20K RCU. Which two additional pieces of information are required to determine the cause of the throttling? (Choose two.) A. The structure of any GSIs that have been defined on the table B. CloudWatch data showing consumed and provisioned write capacity when writes are being throttled C. Application-level metrics showing the average item size and peak update rates for each attribute D. The structure of any LSIs that have been defined on the table
A, B
70
A data engineer in a manufacturing company is designing a data processing platform that receives a large volume of unstructured data. The data engineer must populate a well-structured star schema in Amazon Redshift. What is the most efficient architecture strategy for this purpose? A. Transform the unstructured data using Amazon EMR and generate CSV data. COPY the CSV data into the analysis schema within Redshift. B. Load the unstructured data into Redshift, and use string parsing functions to extract structured data for inserting into the analysis schema. C. When the data is saved to Amazon S3, use S3 Event Notifications and AWS Lambda to transform the file contents. Insert the data into the analysis schema on Redshift. D. Normalize the data using an AWS Marketplace ETL tool, persist the results to Amazon S3, and use AWS Lambda to INSERT the data into Redshift.
A
71
A data engineer chooses Amazon DynamoDB as a data store for a regulated application. This application must be submitted to regulators for review. The data engineer needs to provide a control framework that lists the security controls from the process to follow to add new users down to the physical controls of the data center, including items like security guards and cameras. How should this control mapping be achieved using AWS? A. Request AWS third-party audit reports and/or the AWS quality addendum and map the AWS responsibilities to the controls that must be provided. B. Request data center Temporary Auditor access to an AWS data center to verify the control mapping. C. Request relevant SLAs and security guidelines for Amazon DynamoDB and define these guidelines within the applications architecture to map to the control framework. D. Request Amazon DynamoDB system architecture designs to determine how to map the AWS responsibilities to the control that must be provided.
A
72
A large grocery distributor receives daily depletion reports from the field in the form of gzip archives of CSV files uploaded to Amazon S3. The files range from 500MB to 5GB. These files are processed daily by an EMR job. Recently it has been observed that the file sizes vary, and the EMR jobs take too long. The distributor needs to tune and optimize the data processing workflow with this limited information to improve the performance of the EMR job. Which recommendation should an administrator provide? A. Reduce the HDFS block size to increase the number of task processors. B. Use bzip2 or Snappy rather than gzip for the archives. C. Decompress the gzip archives and store the data as CSV files. D. Use Avro rather than gzip for the archives.
B The best answer is B because the problem is the processing time of large gzip-compressed files. Gzip, while offering high compression, is computationally expensive to decompress, especially for large files processed frequently. Snappy or bzip2 offer faster decompression speeds, even if the compression ratio is slightly lower. This makes them better suited for "hot data" accessed frequently, as described in the problem. A is incorrect because reducing HDFS block size doesn't directly address the compression issue; it impacts data locality and parallel processing but doesn't inherently speed up decompression. C is incorrect because storing decompressed CSV files increases storage costs significantly and doesn't address the underlying performance bottleneck of decompression during processing. While it would improve processing speed, the cost/benefit is unfavorable. D is incorrect while Avro is a good choice for data storage and processing in Hadoop, it is not a direct replacement for gzip and doesn't inherently solve the decompression speed issue, making B the more targeted solution. While Avro offers its own compression, the primary problem is the slow decompression of gzip, which B directly addresses.
73
A social media customer has data from different data sources including RDS running MySQL, Redshift, and Hive on EMR. To support better analysis, the customer needs to be able to analyze data from different data sources and to combine the results. What is the most cost-effective solution to meet these requirements? A. Load all data from a different database/warehouse to S3. Use Redshift COPY command to copy data to Redshift for analysis. B. Install Presto on the EMR cluster where Hive sits. Configure MySQL and PostgreSQL connector to select from different data sources in a single query. C. Spin up an Elasticsearch cluster. Load data from all three data sources and use Kibana to analyze. D. Write a program running on a separate EC2 instance to run queries to three different systems. Aggregate the results after getting the responses from all three systems.
B The most cost-effective solution is B: Install Presto on the EMR cluster where Hive sits. Presto is a distributed SQL query engine that can connect to various data sources, including MySQL, Redshift, and Hive. Configuring connectors allows querying all three sources within a single query, eliminating the need for data movement or complex custom programming. Option A is inefficient as it involves moving large amounts of data to S3 and then to Redshift. Option C is expensive due to the cost of running an Elasticsearch cluster. Option D requires custom programming and potentially significant processing overhead on the EC2 instance. While technically feasible, these options are less cost-effective than leveraging Presto's existing capabilities within the customer's existing EMR infrastructure.
74
A data engineer is about to perform a major upgrade to the DDL contained within an Amazon Redshift cluster to support a new data warehouse application. The upgrade scripts will include user permission updates, view and table structure changes as well as additional loading and data manipulation tasks. The data engineer must be able to restore the database to its existing state in the event of issues. Which action should be taken prior to performing this upgrade task? A. Run an UNLOAD command for all data in the warehouse and save it to S3. B. Create a manual snapshot of the Amazon Redshift cluster. C. Make a copy of the automated snapshot on the Amazon Redshift cluster. D. Call the waitForSnapshotAvailable command from either the AWS CLI or an AWS SDK.
B The correct answer is B because creating a manual snapshot provides a point-in-time backup of the entire Redshift cluster, allowing for a full restoration to the state before the upgrade in case of problems. This is a comprehensive and readily restorable backup. Option A is incorrect because unloading data to S3 only backs up the data, not the schema or cluster configuration. Restoring from S3 would require reloading the data, and recreating all the DDL, which is a time-consuming and error-prone process. Option C is incorrect because relying on an automated snapshot may not capture the most up-to-date state. Automated snapshots happen on a schedule and the exact timing might not immediately precede the upgrade, potentially leading to data loss. Option D is incorrect because this command is used to check the status of an existing snapshot, not to create one. It's useful after creating a snapshot but doesn't provide the necessary backup before the upgrade begins.
75
A company is using Amazon Machine Learning as part of a medical software application. The application will predict the most likely blood type for a patient based on a variety of other clinical tests that are available when blood type knowledge is unavailable. What is the appropriate model choice and target attribute combination for this problem? A. Multi-class classification model with a categorical target attribute. B. Regression model with a numeric target attribute. C. Binary Classification with a categorical target attribute. D. K-Nearest Neighbors model with a multi-class target attribute.
A The correct answer is A because: Blood type prediction is a multi-class classification problem (A, B, AB, O). The target attribute (blood type) is categorical. Options B and C are incorrect because blood type is not a numeric value and is not binary. Option D is incorrect because, while K-Nearest Neighbors *can* perform multi-class classification, the discussion clarifies that Amazon Machine Learning (the specified platform) does not include KNN.
76
A travel website needs to present a graphical quantitative summary of its daily bookings to website visitors for marketing purposes. The website has millions of visitors per day, but wants to control costs by implementing the least-expensive solution for this visualization. What is the most cost-effective solution? A. Generate a static graph with a transient EMR cluster daily, and store it in an Amazon S3 bucket. B. Generate a graph using MicroStrategy backed by a transient EMR cluster. C. Implement a Jupyter front-end provided by a continuously running EMR cluster leveraging spot instances for task nodes. D. Implement a Zeppelin application that runs on a long-running EMR cluster.
A The correct answer is A because it uses a transient EMR cluster, which only runs when needed to generate the daily graph, minimizing compute costs. Storing the graph in Amazon S3 allows for inexpensive serving of the static image to website visitors. Option B is incorrect because MicroStrategy is a commercial BI tool, adding significant licensing costs. Option C and D are incorrect because they involve continuously running EMR clusters, which are significantly more expensive than using transient clusters that are only active for a short period each day to generate the graph. The continuous running nature of options C and D is unnecessary for a simple daily summary graph intended for marketing purposes.
77
A customer has a machine learning workflow on Amazon EMR that involves multiple quick cycles of reads-writes-reads on Amazon S3. The customer is concerned that subsequent read cycles might miss new data written in prior cycles, impacting the machine learning process. Which approach should the customer use to ensure data consistency across these cycles? A. Turn on EMRFS consistent view when configuring the EMR cluster. B. Use AWS Data Pipeline to orchestrate the data processing cycles. C. Set `hadoop.data.consistency = true` in the core-site.xml file. D. Set `hadoop.s3.consistency = true` in the core-site.xml file.
A The correct answer is A because EMRFS consistent view is specifically designed to address the problem of inconsistent reads from Amazon S3 within a short timeframe. It ensures that subsequent reads see the data written in previous cycles within the same EMRFS session. Option B (using AWS Data Pipeline) is not the best solution for this specific problem. While Data Pipeline can orchestrate the workflow, it doesn't inherently solve the data consistency issue between quick read-write cycles. Options C and D are incorrect because they refer to Hadoop configuration properties that don't directly provide the needed S3 data consistency within an EMRFS context. They are not the recommended approach for this specific scenario.
78
An administrator receives about 100 files per hour into Amazon S3 and will be loading the files into Amazon Redshift. Customers who analyze the data within Redshift gain significant value when they receive data as quickly as possible. The customers have agreed to a maximum loading interval of 5 minutes. Which loading approach should the administrator use to meet this objective? A. Load each file as it arrives because getting data into the cluster as quickly as possibly is the priority. B. Load the cluster as soon as the administrator has the same number of files as nodes in the cluster. C. Load the cluster when the administrator has an even multiple of files relative to Cluster Slice Count, or 5 minutes, whichever comes first. D. Load the cluster when the number of files is less than the Cluster Slice Count.
C
79
A company is centralizing a large number of unencrypted small files from multiple Amazon S3 buckets. The company needs to verify that the files contain the same data after centralization. Which method meets the requirements? A. Compare the S3 Etags from the source and destination objects. B. Call the S3 CompareObjects API for the source and destination objects. C. Place a HEAD request against the source and destination objects comparing SIG v4. D. Compare the size of the source and destination objects.
A The correct answer is A because ETags are hexadecimal values that represent the data integrity of an object in Amazon S3. Comparing the ETags of the source and destination objects ensures that the data hasn't been altered during the centralization process. Option B is incorrect because there is no S3 CompareObjects API. Option C is incorrect because comparing SIG v4 signatures only verifies the authenticity of the request, not the data integrity of the object. Option D is incorrect because comparing only the size of the objects doesn't guarantee that the data itself is identical. Two files could have the same size but different content.
80
An administrator is deploying Spark on Amazon EMR for two distinct use cases: machine learning algorithms and ad-hoc querying. All data will be stored in Amazon S3. Two separate clusters for each use case will be deployed. The data volumes on Amazon S3 are less than 10 GB. How should the administrator align instance types with the clusters' purpose? A. Machine Learning on C instance types and ad-hoc queries on R instance types B. Machine Learning on R instance types and ad-hoc queries on G2 instance types C. Machine Learning on T instance types and ad-hoc queries on M instance types D. Machine Learning on D instance types and ad-hoc queries on I instance types
A The correct answer is A because: * **C instance types (Compute Optimized):** These are suitable for machine learning algorithms which often benefit from high CPU performance. While not explicitly stated as *the best*, they offer a good balance of price and performance for the described workload. * **R instance types (Memory Optimized):** These are well-suited for ad-hoc querying, where fast memory access is crucial for quick response times. The relatively small dataset size (less than 10GB) means that memory optimization would be advantageous over other instance types. Options B, C, and D are incorrect because: * **B:** While R instances are good for memory-intensive tasks (making them suitable for ad-hoc queries), G2 instances (GPU optimized) are not the best choice for this use case given that the dataset is small. * **C:** T and M instances are general purpose, offering neither the compute optimization needed for Machine Learning nor the memory optimization beneficial to ad-hoc querying. * **D:** D and I instances are storage optimized, which is not the primary concern for this scenario, where processing power and memory are more important.
81
An organization is designing an application architecture. The application will have over 100 TB of data and will support transactions that arrive at rates from hundreds per second to tens of thousands per second, depending on the day of the week and time of the day. All transaction data must be durably and reliably stored. Certain read operations must be performed with strong consistency. Which solution meets these requirements? A. Use Amazon DynamoDB as the data store and use strongly consistent reads when necessary. B. Use an Amazon Relational Database Service (RDS) instance sized to meet the maximum anticipated transaction rate and with the High Availability option enabled. C. Deploy a NoSQL data store on top of an Amazon Elastic MapReduce (EMR) cluster, and select the HDFS High Durability option. D. Use Amazon Redshift with synchronous replication to Amazon Simple Storage Service (S3) and row-level locking for strong consistency.
A The correct answer is A because: * **DynamoDB's scalability:** DynamoDB is a NoSQL database designed for high throughput and scalability, handling large datasets and high transaction rates exceeding the capabilities of RDS. The question specifies data exceeding 100TB, which is beyond the capacity limitations of RDS. * **Strong consistency:** DynamoDB supports strongly consistent reads, fulfilling the requirement for certain read operations to have this property. * **Durability and reliability:** DynamoDB offers durable and reliable storage, ensuring data persistence. Option B is incorrect because RDS instances have storage limitations and are not designed to handle datasets exceeding 100TB. Sizing for maximum transaction rate is also not a best practice; DynamoDB's auto-scaling handles variable loads more efficiently. Option C is incorrect because EMR with HDFS is primarily for batch processing and analytics, not for the high-throughput transactional workload described. While HDFS can be configured for high durability, it does not inherently provide the strong consistency required. Option D is incorrect because Amazon Redshift is a data warehouse optimized for analytical queries, not for the transactional requirements described. While synchronous replication to S3 provides durability, Redshift's architecture isn't suitable for the high transaction rates and strong consistency needs.
82
A company with a support organization needs support engineers to be able to search historic cases to provide fast responses on new issues raised. The company has forwarded all support messages into an Amazon Kinesis Stream. This meets a company objective of using only managed services to reduce operational overhead. The company needs an appropriate architecture that allows support engineers to search on historic cases and find similar issues and their associated responses. Which AWS Lambda action is most appropriate? A. Ingest and index the content into an Amazon Elasticsearch domain. B. Stem and tokenize the input and store the results into Amazon ElastiCache. C. Write data as JSON into Amazon DynamoDB with primary and secondary indexes. D. Aggregate feedback in Amazon S3 using a columnar format with partitioning.
A A is the correct answer because Elasticsearch is a search engine designed for this purpose. Ingesting and indexing the data from the Kinesis stream into Elasticsearch allows for efficient searching of historical support cases. B is incorrect because ElastiCache is a caching service, not a search engine. While stemming and tokenization are preprocessing steps for search, storing the results in ElastiCache wouldn't allow for efficient searching. C is incorrect because while DynamoDB is a NoSQL database, it is not optimized for full-text search. While you can use secondary indexes, searching across large volumes of text data would be inefficient. D is incorrect because Amazon S3 is an object storage service, not designed for searching. Although columnar storage can improve query performance in some cases, S3 itself lacks the search capabilities needed for this task.
83
A company that provides economics data dashboards needs to be able to develop software to display rich, interactive, data-driven graphics that run in web browsers and leverages the full stack of web standards (HTML, SVG, and CSS). Which technology provides the most appropriate support for this requirement? A. D3.js B. IPython/Jupyter C. R Studio D. Hue
A
84
An organization would like to run analytics on their Elastic Load Balancing logs stored in Amazon S3 and join this data with other tables in Amazon S3. The users are currently using a BI tool connecting with JDBC and would like to keep using this BI tool. Which solution would result in the LEAST operational overhead? A. Trigger a Lambda function when a new log file is added to the bucket to transform and load it into Amazon Redshift. Run the VACUUM command on the Amazon Redshift cluster every night. B. Launch a long-running Amazon EMR cluster that continuously downloads and transforms new files from Amazon S3 into its HDFS storage. Use Presto to expose the data through JDBC. C. Trigger a Lambda function when a new log file is added to the bucket to transform and move it to another bucket with an optimized data structure. Use Amazon Athena to query the optimized bucket. D. Launch a transient Amazon EMR cluster every night that transforms new log files and loads them into Amazon Redshift.
C
85
An organization is setting up a data catalog and metadata management environment for their numerous data stores currently running on AWS. The data stores are composed of Amazon RDS databases, Amazon Redshift, and CSV files residing on Amazon S3. The catalog should be populated on a scheduled basis, and minimal administration is required to manage the catalog. How can this be accomplished? A. Set up Amazon DynamoDB as the data catalog and run a scheduled AWS Lambda function that connects to data sources to populate the DynamoDB table. B. Use an Amazon database as the data catalog and run a scheduled AWS Lambda function that connects to data sources to populate the database. C. Use AWS Glue Data Catalog as the data catalog and schedule crawlers that connect to data sources to populate the catalog. D. Set up Apache Hive metastore on an Amazon EC2 instance and run a scheduled bash script that connects to data sources to populate the metastore.
C C is correct because AWS Glue Data Catalog is a fully managed service specifically designed for creating and managing data catalogs. Scheduling crawlers within Glue allows for automated population of the catalog from various data sources, including those listed (RDS, Redshift, and S3). This solution minimizes administration compared to options A, B, and D which require more manual setup, configuration, and maintenance. A is incorrect because while using DynamoDB and Lambda is possible, it requires significant custom development and maintenance to handle the complexities of metadata management and data discovery. B is also incorrect for similar reasons to A. Using a generic Amazon database requires custom coding and management to build a data catalog solution. D is incorrect because setting up and managing an Apache Hive metastore on EC2 requires significant operational overhead, contradicting the requirement for minimal administration. It also lacks the managed service benefits and integration capabilities of AWS Glue Data Catalog.