Practice Questions - Amazon AWS Certified Big Data - Specialty Flashcards
(85 cards)
A company’s social media manager requests more staff on the weekends to handle an increase in customer contacts from a particular region. The company needs a report to visualize the trends on weekends over the past 6 months using QuickSight. How should the data be represented?
A. A line graph plotting customer contacts vs. time, with a line for each region
B. A pie chart per region plotting customer contacts per day of the week
C. A map of regions with a heatmap overlay to show the volume of customer contacts
D. A bar graph plotting region vs. volume of social media contacts
A
A web-hosting company is building a web analytics tool to capture clickstream data from all of the websites hosted within its platform and to provide near-real-time business intelligence. This entire system is built on AWS services. The web-hosting company is interested in using Amazon Kinesis to collect this data and perform sliding window analytics. What is the most reliable and fault-tolerant technique to get each website to send data to Amazon Kinesis with every click?
A. After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis PutRecord API. Use the sessionID as a partition key and set up a loop to retry until a success response is received.
B. After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis Producer Library .addRecords method.
C. Each web server buffers the requests until the count reaches 500 and sends them to Amazon Kinesis using the Amazon Kinesis PutRecord API.
D. After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis PutRecord API. Use the exponential back-off algorithm for retries until a successful response is received.
B
The most reliable and fault-tolerant technique is to use the Amazon Kinesis Producer Library (KPL). Option B correctly identifies this. The KPL provides features like batching, retry mechanisms, rate limiting, and aggregation, all crucial for handling the high volume and potential errors inherent in clickstream data. Option A uses PutRecord, which lacks the built-in features of KPL for efficient and reliable delivery. Option C introduces unnecessary buffering that can delay near-real-time analytics. While option D improves on A with exponential backoff, it still lacks the advanced features of KPL. Therefore, B is the superior choice for reliability and fault tolerance in a near real-time system.
A company hosts a portfolio of e-commerce websites across the Oregon, N. Virginia, Ireland, and Sydney AWS regions. Each site keeps log files that capture user behavior. The company has built an application that generates batches of product recommendations with collaborative filtering in Oregon. Oregon was selected because the flagship site is hosted there and provides the largest collection of data to train machine learning models against. The other regions DO NOT have enough historic data to train accurate machine learning models. Which set of data processing steps improves recommendations for each region?
A. Use the e-commerce application in Oregon to write replica log files in each other region.
B. Use Amazon S3 bucket replication to consolidate log entries and build a single model in Oregon.
C. Use Kinesis as a buffer for web logs and replicate logs to the Kinesis stream of a neighboring region.
D. Use the CloudWatch Logs agent to consolidate logs into a single CloudWatch Logs group.
A
The best solution is A. The problem states that other regions lack sufficient data to train their own models. Replicating the Oregon log files to other regions (A) provides them with the data necessary to build and improve their regional recommendation models.
Option B is incorrect because consolidating all logs into a single model in Oregon doesn’t improve regional recommendations; it creates a single, potentially less accurate model for all regions. Option C is not the most efficient approach as it only replicates logs to neighboring regions, not all regions. Option D is incorrect based on the discussion; cross-region CloudWatch Logs consolidation was not available at the time the question was created.
An administrator needs to design a strategy for the schema in a Redshift cluster. The administrator needs to determine the optimal distribution style for the tables in the Redshift schema. In which two circumstances would choosing EVEN distribution be most appropriate? (Choose two.)
A. When the tables are highly denormalized and do NOT participate in frequent joins.
B. When data must be grouped based on a specific key on a defined slice.
C. When data transfer between nodes must be eliminated.
D. When a new table has been loaded and it is unclear how it will be joined to dimension.
A, D
The correct answers are A and D. EVEN distribution in Redshift distributes rows across slices in a round-robin fashion, regardless of column values. This is ideal in two scenarios:
A. When tables are highly denormalized and do not participate in frequent joins: Since data doesn’t need to be grouped based on a specific key for efficient joins, EVEN distribution avoids unnecessary data movement.
D. When a new table is loaded and its join behavior is unclear: Using EVEN distribution provides a neutral starting point until the table’s join patterns are better understood and a more optimized distribution style (KEY or ALL) can be chosen.
Option B is incorrect because it describes KEY distribution, where data is grouped based on a specific key. Option C is incorrect because eliminating data transfer between nodes is the goal of KEY distribution, not EVEN distribution.
An organization needs a data store to handle the following data types and access patterns:
✑ Faceting
✑ Search
✑ Flexible schema (JSON) and fixed schema
✑ Noise word elimination
Which data store should the organization choose?
A. Amazon Relational Database Service (RDS)
B. Amazon Redshift
C. Amazon DynamoDB
D. Amazon Elasticsearch Service
D. Amazon Elasticsearch Service
The correct answer is D because Amazon Elasticsearch Service (Amazon ES) is a fully managed service that offers all the features required by the organization. It supports faceting and search natively. It can handle both flexible (JSON) and fixed schemas. Finally, noise word elimination is a common feature in search engines like Amazon ES.
Option A (Amazon RDS) is primarily a relational database, not ideal for flexible schemas or faceting/search functionalities. Option B (Amazon Redshift) is a data warehouse optimized for analytical queries, not real-time search and faceting. Option C (Amazon DynamoDB) is a NoSQL key-value and document database; while it can handle flexible schemas, it is not designed for complex search and faceting operations.
An Amazon Redshift Database is encrypted using KMS. A data engineer needs to use the AWS CLI to create a KMS encrypted snapshot of the database in another AWS region. Which three steps should the data engineer take to accomplish this task? (Choose three.)
A. Create a new KMS key in the destination region.
B. Copy the existing KMS key to the destination region.
C. Use CreateSnapshotCopyGrant to allow Amazon Redshift to use the KMS key from the source region.
D. In the source region, enable cross-region replication and specify the name of the copy grant created.
E. In the destination region, enable cross-region replication and specify the name of the copy grant created.
A, B, D
The correct answer is A, B, and D. To create a KMS-encrypted snapshot of an Amazon Redshift database in another region, you must first create a KMS key (or copy an existing one) in the destination region (A & B). Then, in the source region, you enable cross-region replication, specifying the newly created KMS key’s grant (D). Option C is incorrect because the snapshot is created in the destination region and thus needs the KMS key in the destination region, not the source. Option E is incorrect because cross-region replication is enabled in the source region, not the destination.
An Amazon EMR cluster using EMRFS has access to petabytes of data on Amazon S3, originating from multiple unique data sources. The customer needs to query common fields across some of the data sets to be able to perform interactive joins and then display results quickly. Which technology is most appropriate to enable this capability?
A. Presto
B. MicroStrategy
C. Pig
D. R Studio
A. Presto
Presto is the most appropriate technology because it is designed for interactive querying and fast joins on large datasets. The question specifically mentions the need for “interactive joins and then display results quickly,” which aligns perfectly with Presto’s capabilities. Pig is better suited for batch processing, MicroStrategy is a business intelligence tool, and R Studio is primarily for statistical computing and data analysis, making them less suitable for this specific use case requiring fast interactive querying of petabytes of data.
A large oil and gas company needs to provide near real-time alerts when peak thresholds are exceeded in its pipeline system. The company has developed a system to capture pipeline metrics such as flow rate, pressure, and temperature using millions of sensors. The sensors deliver to AWS IoT. What is a cost-effective way to provide near real-time alerts on the pipeline metrics?
A. Create an AWS IoT rule to generate an Amazon SNS notification.
B. Store the data points in an Amazon DynamoDB table and poll if for peak metrics data from an Amazon EC2 application.
C. Create an Amazon Machine Learning model and invoke it with AWS Lambda.
D. Use Amazon Kinesis Streams and a KCL-based application deployed on AWS Elastic Beanstalk.
A
A system engineer for a company proposes digitalization and backup of large archives for customers. The systems engineer needs to provide users with a secure storage that makes sure that data will never be tampered with once it has been uploaded. How should this be accomplished?
A. Create an Amazon Glacier Vault. Specify a “Deny” Vault Lock policy on this Vault to block “glacier:DeleteArchive”.
B. Create an Amazon S3 bucket. Specify a “Deny” bucket policy on this bucket to block “s3:DeleteObject”.
C. Create an Amazon Glacier Vault. Specify a “Deny” vault access policy on this Vault to block “glacier:DeleteArchive”.
D. Create secondary AWS Account containing an Amazon S3 bucket. Grant “s3:PutObject” to the primary account.
A
The correct answer is A because a Vault Lock policy, unlike a Vault Access policy, cannot be modified after it’s locked. This ensures that the “glacier:DeleteArchive” action is permanently blocked, preventing any tampering with the data after upload. Option B is incorrect because S3 bucket policies, even with a “Deny” setting, can be changed. Option C is incorrect because a vault access policy can be modified, allowing for potential future tampering. Option D is incorrect as it doesn’t inherently prevent data tampering after upload; it only controls who can upload data.
A clinical trial will rely on medical sensors to remotely assess patient health. Each participating physician requires visual reports each morning. These reports are built from aggregations of all sensor data collected each minute. What is the most cost-effective solution for creating this daily visualization?
A. Use Kinesis Aggregators Library to generate reports for reviewing the patient sensor data and generate a QuickSight visualization on the new data each morning for the physician to review.
B. Use a transient EMR cluster that shuts down after use to aggregate the patient sensor data each night and generate a QuickSight visualization on the new data each morning for the physician to review.
C. Use Spark streaming on EMR to aggregate the patient sensor data in every 15 minutes and generate a QuickSight visualization on the new data each morning for the physician to review.
D. Use an EMR cluster to aggregate the patient sensor data each night and provide Zeppelin notebooks that look at the new data residing on the cluster each morning for the physician to review.
B
The most cost-effective solution is B. Using a transient EMR cluster ensures that resources are only consumed during the nightly aggregation process. The cluster shuts down afterward, minimizing costs compared to continuously running clusters (A, C, and D). QuickSight is a cost-effective visualization tool suitable for generating daily reports for multiple physicians. Option A might be less cost effective if the Kinesis data volume is large. Options C and D involve continuously running resources (Spark streaming or a persistent EMR cluster), making them more expensive than a transient cluster. Option D also uses Zeppelin notebooks, which are less efficient for generating visualizations compared to QuickSight for multiple users.
The department of transportation for a major metropolitan area has placed sensors on roads at key locations around the city. The goal is to analyze the flow of traffic and notifications from emergency services to identify potential issues and to help planners respond to issues within 30 seconds of their occurrence. Which solution should a data engineer choose to create a scalable and fault-tolerant solution that meets this requirement?
A. Collect the sensor data with Amazon Kinesis Firehose and store it in Amazon Redshift for analysis. Collect emergency services events with Amazon SQS and store in Amazon DynamoDB for analysis.
B. Collect the sensor data with Amazon SQS and store in Amazon DynamoDB for analysis. Collect emergency services events with Amazon Kinesis Firehose and store in Amazon Redshift for analysis.
C. Collect both sensor data and emergency services events with Amazon Kinesis Streams and use DynamoDB for analysis.
D. Collect both sensor data and emergency services events with Amazon Kinesis Firehose and use Amazon Redshift for analysis.
A
An administrator needs to design a distribution strategy for a star schema in a Redshift cluster. The administrator needs to determine the optimal distribution style for the tables in the Redshift schema. In which three circumstances would choosing Key-based distribution be most appropriate? (Select three.)
A. When the administrator needs to optimize a large, slowly changing dimension table.
B. When the administrator needs to reduce cross-node traffic.
C. When the administrator needs to optimize the fact table for parity with the number of slices.
D. When the administrator needs to balance data distribution and collocation data.
E. When the administrator needs to take advantage of data locality on a local node for joins and aggregates.
B, D, E
The discussion reveals conflicting answers, but the consensus points to B, D, and E as the most appropriate choices for key-based distribution.
- B. When the administrator needs to reduce cross-node traffic: Key-based distribution minimizes data movement across nodes during joins by ensuring related data resides on the same node. This is a primary benefit of this distribution style.
- D. When the administrator needs to balance data distribution and collocation data: Key-based distribution helps balance data distribution while ensuring data needed for joins is co-located.
- E. When the administrator needs to take advantage of data locality on a local node for joins and aggregates: This directly relates to the reduced cross-node traffic mentioned in B, making this a key advantage of key-based distribution.
Option A is incorrect because large, slowly changing dimension tables are often better suited to ALL distribution. Option C is incorrect because parity with the number of slices is generally associated with even distribution, not key-based distribution.
A company receives data sets coming from external providers on Amazon S3. Data sets from different providers are dependent on one another. Data sets will arrive at different times and in no particular order. A data architect needs to design a solution that enables the company to do the following:
✑ Rapidly perform cross data set analysis as soon as the data becomes available
✑ Manage dependencies between data sets that arrive at different times
Which architecture strategy offers a scalable and cost-effective solution that meets these requirements?
A. Maintain data dependency information in Amazon RDS for MySQL. Use an AWS Data Pipeline job to load an Amazon EMR Hive table based on task dependencies and event notification triggers in Amazon S3.
B. Maintain data dependency information in an Amazon DynamoDB table. Use Amazon SNS and event notifications to publish data to fleet of Amazon EC2 workers. Once the task dependencies have been resolved, process the data with Amazon EMR.
C. Maintain data dependency information in an Amazon ElastiCache Redis cluster. Use Amazon S3 event notifications to trigger an AWS Lambda function that maps the S3 object to Redis. Once the task dependencies have been resolved, process the data with Amazon EMR.
D. Maintain data dependency information in an Amazon DynamoDB table. Use Amazon S3 event notifications to trigger an AWS Lambda function that maps the S3 object to the task associated with it in DynamoDB. Once all task dependencies have been resolved, process the data with Amazon EMR.
D
The discussion overwhelmingly supports option D as the correct answer. The reasoning centers around scalability and cost-effectiveness. DynamoDB is well-suited for storing the configuration data representing task dependencies, offering better scalability than Redis (option C) and RDS (option A). Option B, while using DynamoDB, employs SNS and EC2, which is less efficient and potentially more costly than using Lambda functions (option D) for triggering data processing. The consensus points to option D as the most scalable and cost-effective approach using readily available AWS services.
A game company uses DynamoDB to support its game application. Amazon Redshift stores the past two years of historical data. Game traffic fluctuates throughout the year due to factors like seasons, movie releases, and holidays. An administrator needs to predict the required DynamoDB read and write throughput (RCU and WCU) for each week in advance. Which approach should the administrator use?
A. Feed the data into Amazon Machine Learning and build a regression model.
B. Feed the data into Spark MLlib and build a random forest model.
C. Feed the data into Apache Mahout and build a multi-classification model.
D. Feed the data into Amazon Machine Learning and build a binary classification model.
A
The best approach is A because Redshift contains two years of historical data, which can be used as labeled data to train a regression model. A regression model is suitable for predicting a continuous value, such as the required RCU and WCU. Options B and C are less suitable because they are focused on classification problems, not regression. Option D is also less suitable because a binary classification model would only predict whether the throughput is above or below a certain threshold, not the precise amount needed.
A customer is collecting clickstream data using Amazon Kinesis and is grouping the events by IP address into 5-minute chunks stored in Amazon S3. Many analysts in the company use Hive on Amazon EMR to analyze this data. Their queries always reference a single IP address. Data must be optimized for querying based on IP address using Hive running on Amazon EMR. What is the most efficient method to query the data with Hive?
A. Store an index of the files by IP address in the Amazon DynamoDB metadata store for EMRFS.
B. Store the Amazon S3 objects with the following naming scheme: bucket_name/source=ip_address/year=yy/month=mm/day=dd/hour=hh/filename.
C. Store the data in an HBase table with the IP address as the row key.
D. Store the events for an IP address as a single file in Amazon S3 and add metadata with keys: Hive_Partitioned_IPAddress.
B
The most efficient method is to leverage Hive’s partitioning capabilities in S3. Option B uses a naming scheme that creates partitions based on the IP address and date/time. This allows Hive to quickly locate the relevant data based on the IP address in the query, avoiding a full table scan.
Option A is incorrect because while DynamoDB can store metadata, it’s not directly integrated with Hive’s partitioning functionality in the same way that S3 is. Option C is incorrect because HBase is a NoSQL database, not optimized for Hive queries. Option D is incorrect because while metadata is helpful, it doesn’t improve query efficiency as much as partitioning the data in S3 directly via the file naming convention. Combining all events for a single IP address into one file doesn’t improve performance because Hive would still need to scan that single (potentially very large) file.
A system needs to collect on-premises application spool files into a persistent storage layer in AWS. Each spool file is 2 KB. The application generates 1 million files per hour. Each source file is automatically deleted from the local server after an hour. What is the most cost-efficient option to meet these requirements?
A. Write file contents to an Amazon DynamoDB table.
B. Copy files to Amazon S3 Standard Storage.
C. Write file contents to Amazon ElastiCache.
D. Copy files to Amazon S3 infrequent Access Storage.
A
A company that manufactures and sells smart air conditioning units also offers add-on services so that customers can see real-time dashboards in a mobile application or a web browser. Each unit sends its sensor information in JSON format every two seconds for processing and analysis. The company also needs to consume this data to predict possible equipment problems before they occur. A few thousand pre-purchased units will be delivered in the next couple of months. The company expects high market growth in the next year and needs to handle a massive amount of data and scale without interruption. Which ingestion solution should the company use?
A. Write sensor data records to Amazon Kinesis Streams. Process the data using KCL applications for the end-consumer dashboard and anomaly detection workflows.
B. Batch sensor data to Amazon Simple Storage Service (S3) every 15 minutes. Flow the data downstream to the end-consumer dashboard and to the anomaly detection application.
C. Write sensor data records to Amazon Kinesis Firehose with Amazon Simple Storage Service (S3) as the destination. Consume the data with a KCL application for the end-consumer dashboard and anomaly detection.
D. Write sensor data records to Amazon Relational Database Service (RDS). Build both the end-consumer dashboard and anomaly detection application on top of Amazon RDS.
A
The discussion highlights that while option A (using Kinesis Streams with KCL) is the best fit for real-time processing and scaling, some concerns were raised about resharding causing interruptions. However, options B, C, and D are clearly inferior. B is unsuitable for real-time dashboards due to the 15-minute batching. C incorrectly states that KCL can be used with Kinesis Firehose. D is unsuitable because RDS is not designed for the volume and velocity of data involved, and it does not scale efficiently for this use case. Therefore, while acknowledging the resharding limitations of A, it remains the most suitable option among the choices presented. The question emphasizes real-time processing and massive scaling, making A the closest fit despite the potential resharding challenges.
A media advertising company handles a large number of real-time messages sourced from over 200 websites. The company’s data engineer needs to collect and process records in real time for analysis using Spark Streaming on Amazon Elastic MapReduce (EMR). The data engineer needs to fulfill a corporate mandate to keep ALL raw messages as they are received as a top priority. Which Amazon Kinesis configuration meets these requirements?
A. Publish messages to Amazon Kinesis Firehose backed by Amazon Simple Storage Service (S3). Pull messages off Firehose with Spark Streaming in parallel to persistence to Amazon S3.
B. Publish messages to Amazon Kinesis Streams. Pull messages off Streams with Spark Streaming in parallel to AWS Lambda pushing messages from Streams to Firehose backed by Amazon Simple Storage Service (S3).
C. Publish messages to Amazon Kinesis Firehose backed by Amazon Simple Storage Service (S3). Use AWS Lambda to pull messages from Firehose to Streams for processing with Spark Streaming.
D. Publish messages to Amazon Kinesis Streams, pull messages off with Spark Streaming, and write row data to Amazon Simple Storage Service (S3) before and after processing.
D
The discussion highlights that option D is the most suitable because it allows Spark Streaming to directly read from Kinesis Streams, fulfilling the real-time processing requirement. The raw data is written to S3 both before and after processing, ensuring that all raw messages are preserved as mandated.
Option A is incorrect because Spark Streaming cannot directly read from Kinesis Firehose. Option B is inefficient and adds unnecessary complexity by using Lambda as an intermediary. Option C is also incorrect because it introduces a delay by routing data through Firehose and Lambda before reaching Spark Streaming, violating the real-time requirement and potentially leading to data loss if the process isn’t perfectly reliable.
A new algorithm has been written in Python to identify SPAM e-mails. The algorithm analyzes the free text contained within a sample set of 1 million e-mails stored on Amazon S3. The algorithm must be scaled across a production dataset of 5 PB, which also resides in Amazon S3 storage. Which AWS service strategy is best for this use case?
A. Copy the data into Amazon ElastiCache to perform text analysis on the in-memory data and export the results of the model into Amazon Machine Learning.
B. Use Amazon EMR to parallelize the text analysis tasks across the cluster using a streaming program step.
C. Use Amazon Elasticsearch Service to store the text and then use the Python Elasticsearch Client to run analysis against the text index.
D. Initiate a Python job from AWS Data Pipeline to run directly against the Amazon S3 text files.
B
The best option is B because Amazon EMR (Elastic MapReduce) is designed for processing massive datasets like the 5 PB of email data described. EMR allows for parallel processing using Hadoop or Spark, which are ideal for distributing the computational load across a cluster. This efficiently handles the scale of the data and the computational demands of the text analysis algorithm.
Option A is incorrect because ElastiCache is an in-memory data store, not designed for processing 5 PB of data. Option C is incorrect because Elasticsearch, while useful for indexing and searching, is not ideal for processing this scale of data; it has storage limitations. Option D is incorrect because while Data Pipeline can manage jobs, it doesn’t inherently provide the parallel processing capabilities necessary for efficiently handling such a large dataset. Directly processing the data in S3 using a single Python job would be extremely slow and inefficient.
An administrator needs to design the event log storage architecture for events from mobile devices. The event data will be processed by an Amazon EMR cluster daily for aggregated reporting and analytics before being archived. How should the administrator recommend storing the log data?
A. Create an Amazon S3 bucket and write log data into folders by device. Execute the EMR job on the device folders.
B. Create an Amazon DynamoDB table partitioned on the device and sorted on date, write log data to table. Execute the EMR job on the Amazon DynamoDB table.
C. Create an Amazon S3 bucket and write data into folders by day. Execute the EMR job on the daily folder.
D. Create an Amazon DynamoDB table partitioned on EventID, write log data to table. Execute the EMR job on the table.
C
The best approach is to store the log data in an Amazon S3 bucket, organized into folders by day (C). This aligns with the daily processing requirement of the EMR cluster. EMR is efficient at processing many objects from a single S3 folder in parallel, making daily aggregation straightforward. Options A, B, and D are less efficient: A is less efficient because processing by device would require many separate EMR jobs; B and D use DynamoDB, which is designed for low-latency reads and writes, not batch processing like this scenario requires, and incurs higher costs compared to S3. The discussion highlights that partitioning in S3 is not necessary given EMR’s ability to handle parallel processing of many objects.
There are thousands of text files on Amazon S3. The total size of the files is 1 PB. The files contain retail order information for the past 2 years. A data engineer needs to run multiple interactive queries to manipulate this data. The Data Engineer has AWS access to spin up an Amazon EMR cluster. The data engineer needs to use an application on the cluster to process this data and return the results in an interactive timeframe. Which application on the cluster should the data engineer use?
A. Oozie
B. Apache Pig with Tachyon
C. Apache Hive
D. Presto
D
The best answer is Presto because the question emphasizes the need for interactive query processing. Presto is designed for fast query execution and low latency, making it ideal for interactive analysis of large datasets.
While Apache Hive is suitable for large-scale batch processing, it’s not optimized for interactive queries. Oozie is a workflow scheduler, not a query engine. Apache Pig with Tachyon is better suited for ETL (Extract, Transform, Load) processes and data manipulation rather than interactive querying, although Tachyon’s in-memory capabilities could improve performance. The primary requirement here is interactive querying, which makes Presto the most appropriate choice.
Company A operates in Country X and maintains a large dataset of historical purchase orders containing personal data (full names and telephone numbers) in five 1TB text files. This data is stored on-premises due to legal requirements. The R&D department needs to run a clustering algorithm using Amazon EMR in the closest AWS region, with a minimum latency of 200ms between the on-premises system and the AWS region. Which option allows Company A to perform clustering in the AWS Cloud while complying with the legal requirement of keeping personal data within Country X?
A. Anonymize the personal data, transfer the files to Amazon S3 in the AWS region, and have the EMR cluster read the data using EMRFS.
B. Establish a Direct Connect link between the on-premises system and the AWS region to reduce latency, and have the EMR cluster read the data directly from the on-premises storage system over Direct Connect.
C. Encrypt the data files according to Country X’s encryption standards, store them in Amazon S3 in the AWS region, and have the EMR cluster read the data using EMRFS.
D. Use AWS Import/Export Snowball to transfer the data to the AWS region, copy the files to an EBS volume, and have the EMR cluster read the data using EMRFS.
B
The correct answer is B because it is the only option that keeps the data on-premises, thereby fulfilling the legal requirement of storing personal data in Country X. Options A, C, and D all involve transferring the data to an AWS region, violating the legal requirement. While options C and D involve encryption, encryption alone does not satisfy the requirement of keeping the data within Country X. Option A’s anonymization also doesn’t address the legal requirement; the act of anonymization itself may be a violation. Option B directly addresses the problem by utilizing Direct Connect to access the data without moving it.
A customer has an Amazon S3 bucket. Objects are uploaded simultaneously by a cluster of servers from multiple streams of data. The customer maintains a catalog of objects uploaded in Amazon S3 using an Amazon DynamoDB table. This catalog has the following fields: StreamName, TimeStamp, and ServerName, from which ObjectName can be obtained. The customer needs to define the catalog to support querying for a given stream or server within a defined time range. Which DynamoDB table scheme is most efficient to support these queries?
A. Define a Primary Key with ServerName as Partition Key and TimeStamp as Sort Key. Do NOT define a Local Secondary Index or Global Secondary Index.
B. Define a Primary Key with StreamName as Partition Key and TimeStamp followed by ServerName as Sort Key. Define a Global Secondary Index with ServerName as partition key and TimeStamp followed by StreamName.
C. Define a Primary Key with ServerName as Partition Key. Define a Local Secondary Index with StreamName as Partition Key. Define a Global Secondary Index with TimeStamp as Partition Key.
D. Define a Primary Key with ServerName as Partition Key. Define a Local Secondary Index with TimeStamp as Partition Key. Define a Global Secondary Index with StreamName as Partition Key and TimeStamp as Sort Key.
A
The most efficient solution is A. Option A allows for efficient queries by ServerName within a time range (using the TimeStamp sort key). This directly addresses the customer’s requirement to query by server within a time range. Options B, C, and D introduce unnecessary indexes and complexities. While they might allow for querying by StreamName, they don’t offer a significant advantage over option A and add overhead. The “or” in the customer’s requirement implies that querying by either StreamName or ServerName is acceptable, not necessarily both simultaneously. Therefore, A, focusing on efficient queries by ServerName (the more likely frequent query, given the simultaneous uploads from multiple servers), is the optimal choice.
A company has several teams of analysts. Each team of analysts has their own cluster. The teams need to run SQL queries using Hive, Spark-SQL, and Presto with Amazon EMR. The company needs to enable a centralized metadata layer to expose the Amazon S3 objects as tables to the analysts. Which approach meets the requirement for a centralized metadata layer?
A. EMRFS consistent view with a common Amazon DynamoDB table
B. Bootstrap action to change the Hive Metastore to an Amazon RDS database
C. s3distcp with the outputManifest option to generate RDS DDL
D. Naming scheme support with automatic partition discovery from Amazon S3
B
The correct answer is B because it directly addresses the need for a centralized metadata layer. A centralized Hive metastore in Amazon RDS allows all analysts, regardless of their cluster, to access the same metadata, effectively exposing S3 objects as tables.
Option A is incorrect because EMRFS consistent view focuses on data consistency, not on providing a centralized metadata layer for querying S3 objects as tables. Option C is incorrect because s3distcp
is a data transfer tool, not a metadata management solution. Option D relies on naming conventions and automatic partition discovery, which is insufficient for a truly centralized and robust metadata layer; it’s less reliable and scalable than a dedicated metastore.