Practice Questions - Amazon AWS Certified Big Data - Specialty Flashcards

Question

A data engineer is running a data warehouse (DWH) on a 25-node Redshift cluster for a SaaS service. Five large customers represent 80% of usage, while dozens of smaller customers make up the remaining 20%. A dashboarding tool has been selected. How should the data engineer ensure that the larger customer workloads do NOT interfere with the smaller customer workloads? A. Apply query filters based on customer-id that cannot be changed by the user and apply distribution keys on customer-id. B. Place the largest customers into a single user group with a dedicated query queue and place the rest of the customers into a different query queue. C. Push aggregations into an RDS for Aurora instance. Connect the dashboard application to Aurora rather than Redshift for faster queries. D. Route the largest customers to a dedicated Redshift cluster. Raise the concurrency of the multi-tenant Redshift cluster to accommodate the remaining customers.

Answer 1

B The best solution is to use Redshift's query queueing system to separate large and small customers. Option B suggests creating separate query queues for large and small customers, ensuring that resource contention is minimized. Large customers get dedicated resources, preventing them from impacting smaller customers. Option A is incorrect because while query filters and distribution keys improve query performance, they don't guarantee resource separation between large and small customers. Large queries could still consume disproportionate resources. Option C is incorrect because using RDS for Aurora is inefficient for this scenario. Aggregating data in Aurora and then fetching it adds latency and doesn't solve the core problem of resource contention in the Redshift cluster. Option D is a more complex and costly solution. While dedicating a separate Redshift cluster for large customers would resolve the issue, it's not the most efficient approach, especially if the 80/20 split of resource usage isn't consistently maintained. Raising the concurrency of the main cluster for the remaining 20% of customers isn't necessary if the existing resources are already sufficient to handle them.

Answer 2

A The correct answer is A because client-side encryption on the producer side is the standard and recommended way to encrypt data before it enters an Amazon Kinesis stream. Options B and D are incorrect because partitioning and sharding are data organization mechanisms, not encryption methods. They do not provide encryption. Option C is incorrect because encrypting on the consumer side does not protect the data while it's in transit within the Kinesis stream; encryption must happen *before* the data enters the stream.

Answer 3

A The discussion strongly suggests A is the correct answer. Options B, C, and D are all deemed incorrect for the following reasons: * **B:** DynamoDB doesn't allow changing a primary key after table creation. Even if it did, the suggested approach wouldn't necessarily reduce RCU and WCU significantly, as the load would be distributed between the base table and GSI. * **C:** DynamoDB does not support foreign keys. * **D:** LSIs cannot be created after table creation, and the table already uses date as a sort key. Furthermore, there are size limitations on LSIs. Option A, using eventually consistent reads, is the most viable solution because it directly reduces the read capacity units (RCU) required, thus lowering the cost. The one-second delay in data consistency is acceptable for the transaction history use case, as described in the problem.

Answer 4

C The most efficient technique is to define the ETL steps as separate AWS Data Pipeline activities (option C). AWS Data Pipeline is designed for managing and orchestrating ETL workflows, ensuring sequential processing. Options A, B, and D involve managing the data flow and dependencies manually, which is less efficient and more error-prone than using a dedicated workflow management service like AWS Data Pipeline. Option A uses EMRFS and S3, which doesn't inherently enforce sequential processing. Option B similarly uses S3, implying manual management of dependencies. Option D uses HDFS, which adds complexity and overhead for this specific use case where intermediate outputs are not persisted. Therefore, AWS Data Pipeline offers the most efficient and reliable solution for this sequential ETL process where intermediate data is not retained.

Answer 5

A The optimal approach is to use Amazon Machine Learning (A) to build a binary classification model. This is because predicting customer churn is inherently a prediction problem best suited to machine learning algorithms. A binary classification model can effectively learn patterns from the historical data to predict whether a customer will churn or not. Option B (AWS QuickSight) is unsuitable because while it can visualize data, it cannot perform predictive modeling. Option C (EMR and Hive) is less optimal because while it can analyze data, it requires manual feature engineering and model building which is less efficient than a purpose-built machine learning service. Option D (Redshift) is also less optimal as it focuses on data warehousing and processing rather than predictive modeling. While a UDF could be created, it would require significant manual development compared to using a pre-built machine learning service.

Answer 6

B The correct answer is B because it uses server-side encryption with S3-managed keys, providing encryption and key management handled by AWS. S3 Access Control Lists (ACLs) allow for fine-grained access control to specify which users can access specific files. Using S3 access logs provides traceability of access. Options A and D incorrectly rely on AWS KMS grants for access control, which is less granular and more complex than using S3 ACLs for this specific scenario. Option C is incorrect because client-side encryption requires the client to manage the encryption keys, which negates the ease of use with standard AWS tools and introduces operational complexity. The question requires a solution using standard tools, and client-side encryption would not satisfy this requirement.

Answer 7

B The best approach is to use Amazon Redshift. The question asks to identify locations for *new* bicycle stations to maximize rider access. While option D suggests using machine learning (stochastic gradient descent) which could theoretically be applied, it is an overkill. The existing data already contains information on station popularity ("Number of slots available and taken at a given time"). A simple SQL query in Redshift can identify the most popular stations (highest number of slots taken) which are strong candidates for expansion. Options A and C involve unnecessary data movement and complexities. Option D, while leveraging EMR and potentially offering more scalability, introduces unnecessary complexity for this problem. Option B provides a direct, efficient, and cost-effective solution.

Answer 8

A The discussion overwhelmingly supports option A as the best solution. The key reasons are: * **Fine-grained access control:** DynamoDB, combined with AWS STS and IAM policies, offers fine-grained access control (FGAC), allowing precise specification of which data each supplier role can access. This directly addresses the requirement for limiting access to specific fields. * **Least management overhead:** Creating and managing thousands of views (option D), tables (option C), or files (option B) would be significantly more complex than managing IAM policies for DynamoDB. DynamoDB's FGAC combined with STS reduces the management burden substantially. Option B is less efficient due to the overhead of creating and managing numerous files in S3. Option C is overly complex and requires significant infrastructure setup and maintenance with HBase, Kerberos, and ACL management. Option D is also inefficient because managing thousands of views in Redshift would be time-consuming and complex.

Answer 9

D The correct answer is D because S3DistCp is specifically designed for efficiently merging many small files into larger files, directly addressing the performance bottleneck caused by processing numerous small files. This requires minimal changes to the existing process, as it only involves scheduling a daily aggregation task. Option A is incorrect because creating a search index in DynamoDB doesn't directly improve the performance of the MapReduce job itself. Option B is incorrect because while Spark and Resilient Distributed Datasets can improve performance, it requires significant code changes to the existing MapReduce application. Option C is incorrect as it involves indexing into Elasticsearch, which is irrelevant to the core MapReduce processing. Option E would change the data ingestion pipeline significantly, requiring modification to the business units' submission process.

Answer 10

B The correct answer is B because the problem stems from Lambda exceeding its 5-minute time limit while processing batches. Reducing the batch size will decrease the processing time for each invocation, making it less likely to hit the time limit. Option A is incorrect because adding more Lambda functions doesn't address the root cause—the large batch size causing timeouts. Option C is incorrect because ignoring events and sending them to the DLQ doesn't solve the processing issue; it only hides the symptom. Option D is incorrect because each Kinesis shard already invokes a separate Lambda function. Reducing the number of shards a single Lambda handles is not possible in this architecture. The solution lies in adjusting the batch size within the Lambda function configuration.

Answer 11

C The correct answer is C because it provides the most cost-effective and time-efficient solution for the auditors' needs. DynamoDB is a NoSQL database optimized for fast key-value lookups, making it ideal for quickly searching metadata based on patient, date, or physician. Partitioning by month and year further improves query performance. Option A and B are incorrect because EMR, while suitable for large-scale batch processing, isn't optimized for the quick, ad-hoc searches required by auditors. Searching through millions of files in S3 based on patient or date would be extremely time-consuming and inefficient. Option D is incorrect because Redshift is a data warehouse optimized for analytical queries on large datasets, not for the frequent, low-latency lookups required by auditors. Using Redshift would be an overkill and less cost-effective than DynamoDB for this use case.

Answer 12

A The discussion overwhelmingly supports option A as the most cost-effective solution. While the suggested answer is C, the community feedback strongly indicates that using S3 (option A) with appropriate tools for data ingestion, processing, and querying is a more cost-effective approach than the alternatives. Option B (EMR with EBS) is likely more expensive due to the continuous cluster operation and management overhead. Option C (Redshift with S3 replication) introduces the cost and complexity of Redshift, while option D (RDS with Aurora) is designed for relational data, not the diverse and unstructured data mentioned in the problem statement. S3's pay-as-you-go model aligns best with the need for a flexible and cost-effective solution for large-scale storage of various data types.

Answer 13

D The correct answer is D because using AWS Glue Data Catalog as a Hive metastore offers the least operational effort for sharing metadata across multiple transient EMR clusters. The discussion highlights that using RDS for the metastore is not recommended for concurrent writes, and setting up a self-managed solution like Derby (option C) requires significant operational overhead. AWS Glue is designed for this purpose, offering scalability and ease of management compared to other options. Option A, using DynamoDB, is also less efficient than Glue for this use case.

Answer 14

B The best option is B because it provides a visual representation of the website visitor trend over time, allowing for quick identification of deviations from the historical average. This is crucial for spotting potential system problems. Option A (stacked bar chart) is less effective because it lacks the granular detail of a line chart and doesn't easily show the trends over time. Option C (KPI metric) only shows a single number, providing no visual context or trend information and making it hard to judge whether a deviation is significant or simply normal fluctuation. Option D (scatter plot) is also less suitable as it does not directly show the time-series trend needed to identify problems. While a sudden spike or drop might be visible, the temporal aspect is not clearly represented.

Answer 15

D The most efficient approach is to use the S3DistCp tool. S3DistCp is designed for parallel copying of large datasets from Hadoop Distributed File System (HDFS) to Amazon S3, making it ideal for this scenario. It leverages the Hadoop framework's ability to handle large files and distribute the transfer work across multiple nodes, which significantly accelerates the process. This directly addresses the requirement of transferring 20TB of data within a 2-day timeframe. Option A (Snowball) is too slow for this volume of data and the available bandwidth. Option B (Java SDK multipart upload) is less efficient than a Hadoop-optimized tool like S3DistCp for moving large datasets. Option C (S3 Transfer Acceleration) is beneficial for high-latency transfers, but given the Direct Connect link and the need for direct Hadoop integration, it's not as efficient as S3DistCp.

Answer 16

B The best solution is B because it addresses both data durability and scalability concerns. Amazon S3 offers high durability and redundancy, preventing data loss. Using the EMR File System allows the EMR cluster to access the data stored in S3, providing the required compute resources for processing. The ability to scale out the EMR cluster accommodates the variable data volume, including spikes. Option A is incorrect because while EBS volumes offer persistence, they lack the scalability of S3 for handling large datasets and spikes in data volume. Option C is incorrect because DynamoDB is a NoSQL database optimized for low-latency access, not for large-scale batch processing and storage of large log files. Option D is incorrect because it removes the EMR cluster, which is a requirement in the problem statement. While Kinesis Firehose and Elasticsearch are good for log management, they don't fulfill the requirement of using EMR for data processing.

Answer 17

D The best solution is D because it leverages the already structured data stream in Amazon Kinesis. Option A is incorrect because it attempts to load unstructured log files directly into Redshift, which is inefficient and likely to fail. Option B is incorrect because querying unstructured data in S3 using Redshift Spectrum is inefficient. Option C is incorrect because it adds an extra step (Lambda processing) when a more direct method exists. Option D provides the least number of operations by directly streaming the already structured data from Kinesis to Redshift.

Answer 18

A The discussion reveals that option A is the correct answer. Option B is incorrect because, while it addresses encryption, EMR doesn't support Hive authorization for EMRFS and S3. Option C, while utilizing KMS for encryption, doesn't specifically address the granular access control needed at the database and table level for SELECT statements only. Option D focuses on network security and authentication but doesn't provide the necessary fine-grained authorization controls. Option A correctly leverages Apache Ranger, which provides the capability to implement fine-grained access control at the database and table levels, restricting access to only the SELECT statement as required.

Answer 19

A The correct answer is A because a cryptographic hash function will produce the same output for the same input. This allows the Data Scientist to group transactions based on the identical hash value, representing the same credit card number, without ever having access to the actual credit card number itself. The process is one-way, meaning the original credit card number cannot be recovered from the hash. Option B is incorrect because it violates data governance policies by giving the Data Scientist access to the decryption key, which would allow them to access the actual credit card numbers. Option C is incorrect because masking only the last four digits is insufficient to prevent the disclosure of credit card numbers. There's a high likelihood of collisions (different cards sharing the same last four digits). Option D is also incorrect for the same reason as option B; it still requires giving the Data Scientist access to sensitive information (the decryption key).

Answer 20

B The best approach is B because it leverages services designed for high throughput and low latency, crucial for real-time bidding. DynamoDB is a NoSQL database optimized for high-performance reads and writes, and DynamoDB Accelerator further enhances performance. DynamoDB Streams provide a mechanism for capturing data changes in near real-time, which are then processed by Lambda and sent to Elasticsearch for dashboarding. This entire pipeline is well-suited for the requirements of near real-time analysis. Option A uses Aurora, which, while scalable, may not offer the same level of low-latency performance as DynamoDB, especially under extreme load. The use of ElastiCache is helpful, but the underlying database remains a potential bottleneck. Option C uses RDS, which is less performant than DynamoDB for this use case. Using EMR and Sqoop for analysis introduces significant latency, making it unsuitable for near real-time dashboarding. Option D uses Redshift, a data warehouse designed for analytical queries, not real-time data serving. The use of pgpool and ElastiCache adds complexity without addressing the core latency issue stemming from Redshift’s inherent limitations in real-time processing. The latency introduced by using Redshift for analysis is too high for the near real-time dashboarding requirement.

Answer 21

D The discussion strongly suggests that option D is the most cost-effective solution. Options A, B, and C involve more complex and resource-intensive architectures. Option A uses a batch processing approach (Athena) which is not suitable for real-time anomaly detection. Option B requires managing an entire EMR cluster, which is significantly more expensive than other options. Option C involves managing a fleet of EC2 instances, increasing operational complexity and cost. Option D leverages Kinesis Analytics with the RANDOM_CUT_FOREST function, a purpose-built solution for real-time anomaly detection within the Kinesis ecosystem, making it the most cost-effective choice.

Answer 22

B, D The best approach involves two key aspects: migrating the existing 10 PB of data and creating a real-time solution for the new data. B is the correct choice for migrating the existing data. Using Direct Connect for high-bandwidth transfer and S3 Acceleration to optimize the transfer to S3 is efficient for a large dataset like 10 PB. Athena then provides cost-effective querying of the data in S3. Option E, using Snowball Edge, is too slow for a one-month deadline given the data volume. D is the best choice for the real-time solution. It leverages Kinesis Data Streams for high-throughput ingestion of the device data, and Kinesis Firehose streams the data to S3 for long-term storage while simultaneously routing a subset of fields to DynamoDB for real-time analytics via a Lambda function. DynamoDB is ideal for the real-time needs due to its low latency and scalability. Option A, using SQS and an Auto Scaling group, is significantly slower and less efficient than Kinesis for this volume. Option C, while using Kinesis, uses Elasticsearch, which might not be as cost-effective or scalable for this volume of unstructured data compared to DynamoDB. The use of two firehose streams in C is also less efficient than the single stream and lambda function in D.

Answer 23

A, F, E The correct answer is A, F, and E because: * **A. Create a new KMS key in the destination region:** A new KMS key is needed in the destination region to encrypt the snapshot created there. The source region's KMS key cannot directly encrypt data in another region. * **F. Use CreateSnapshotCopyGrant to allow Amazon Redshift to use the KMS key created in the destination region:** This grant explicitly allows Redshift in the destination region to use the newly created KMS key for encryption. * **E. In the destination region, enable cross-region replication and specify the name of the copy grant created:** Cross-region replication needs to be enabled in the *destination* region, pointing to the newly created KMS key and grant, to allow the snapshot copy to be made and encrypted correctly. The other options are incorrect because: * **B. Copy the existing KMS key to the destination region:** While you can copy KMS keys, it's not necessary or the most efficient method. Creating a new key in the destination region is better practice. * **C. Use CreateSnapshotCopyGrant to allow Amazon Redshift to use the KMS key from the source region:** This would be trying to use a key from a different region, which is not possible for direct encryption of the snapshot in the destination region. * **D. In the source region, enable cross-region replication and specify the name of the copy grant created:** Cross-region replication is configured in the destination region, not the source.

Answer 24

C The most efficient method is C because it leverages the high-speed COPY command in Redshift for loading data from S3. Option A is incorrect because DBLINKs are generally less efficient than other methods and are not designed to work seamlessly with on-premises databases. Option B, while possible, introduces additional complexity and overhead with Kinesis and KCL, making it less efficient. Option D is incorrect because loading all changes into a single S3 object would hinder the parallel loading capabilities of the COPY command and would be inefficient for larger datasets and multiple tables. Using multiple S3 objects (one per table, or a logical grouping of tables) allows for parallel loading into Redshift, significantly improving efficiency.

Answer 25

D The most appropriate model is a regression model. The discussion highlights that options A and B are incorrect because the administrator lacks the necessary target labels for supervised learning (binary classification requires labeled data). Option C is incorrect as it's a type of classification which also needs labeled data. Option D, a regression model predicting the probability of a response, is the only viable choice when target labels are missing; it can learn patterns from the text data without pre-existing classifications. While not explicitly stated as the *most* appropriate, it is the only option feasible given the constraints.

Answer 26

C The most cost-effective solution is C. The `ProvisionedThroughputExcededException` indicates that the Kinesis stream's throughput is insufficient to handle the peak traffic. While increasing shards (B) would increase throughput, it's not cost-effective because it permanently increases costs, even when traffic is low. Creating multiple streams (A) adds complexity and doesn't address the root cause of bursts in requests. Adding more consumers (D) improves processing speed but doesn't increase the amount of data the stream can accept. Using the Kinesis Producer Library (KPL) (C) allows for batching requests, significantly reducing the number of API calls and thus the load on the stream, making it the most cost-effective solution for handling peak traffic spikes. KPL also offers features like automatic retries and buffering, increasing reliability.

Answer 27

C The best answer is C because the problem requires classifying items into three categories (minimally dangerous, dangerous, highly dangerous), which is a multi-class classification problem. A machine learning model is well-suited for this task, and processing the data stream from DynamoDB ensures real-time classification of new items. Option A (regular expressions) is likely insufficient for the complexity and nuance of textual descriptions related to danger levels. Option B focuses on reports rather than proactively classifying new items. Option D is incorrect because it proposes binary classification, which only allows for two categories, while the problem requires three.

Answer 28

C The discussion strongly suggests that option C is the correct answer. The key reasons are: * **Web Identity Federation and Tokens:** Option C explicitly mentions the use of a token obtained through Amazon.com web identity federation, which is crucial for central authentication and authorization. Options A, B, and D either lack this explicit mention or are less clear in their description. * **DynamoDB Suitability:** While RDS Aurora (options A and B) is a viable database choice, DynamoDB (options C and D) is better suited for fast data lookup, especially for a large number of users, which is implied by the public web application requirement. DynamoDB's scalability is a significant advantage. * **IAM Role and Policy:** Option C correctly outlines the use of an IAM role, leveraging the `dynamodb:LeadingKeys` context key within an IAM policy to restrict access to only the user's own data. This provides the necessary row-level security. Option D is similar but is missing the explicit token acquisition mentioned in option C. Options A and B incorrectly focus on RDS which isn't as efficient in this context. Options A and B are incorrect because they use RDS Aurora which is not as scalable or efficient for this specific use case as DynamoDB. Option D is incorrect because it misses the explicit mention of acquiring a token via web identity federation, a critical aspect for central authentication.

Answer 29

A. Start with many small nodes. The discussion indicates that starting with many smaller nodes (e.g., ds2.xlarge) is more cost-effective than starting with fewer larger nodes (e.g., ds2.8xlarge) for a 50TB dataset with high query complexity involving many joins. While fewer large nodes might seem initially appealing, the increased computational power may be wasted if it's not needed, making it a more expensive option. Option C is inefficient and unnecessary, and Option D contradicts the customer's desire to avoid testing. The provided calculations in the discussion support the selection of many smaller nodes for better cost optimization given the data volume and query characteristics.

Answer 30

C C is the simplest solution because it requires only configuration of the CloudWatch Logs agent and a CloudWatch Logs subscription to Amazon Elasticsearch Service. This avoids the need for custom code (unlike options B and D) and directly addresses the requirement of minimizing data loss by writing to the local disk first, before sending to the more durable CloudWatch Logs. Option A is incorrect because writing directly to Redshift is not an efficient or typical logging approach. Option B requires additional coding and is more complex than option C. Option D introduces unnecessary complexity by using EMR, which is not a storage service but a processing platform.

Answer 31

D The correct answer is D because workload management (WLM) queue hopping allows for manual adjustment of query execution priorities and resource allocation. Different user teams with varying query types and performance needs benefit from the ability to move queries between queues to optimize resource utilization and improve overall performance. Option A is incorrect because creating custom views doesn't inherently improve query performance; they only change how data is accessed, not the underlying query execution. Option B is incorrect because adding interleaved sort keys per team is impractical and not scalable; altering sort keys is difficult and inefficient. Option C is incorrect because maintaining team-specific copies of the table is inefficient, wasteful of storage, and doesn't address the root cause of performance issues. It's a brute-force solution that's generally not recommended.

Answer 32

D The most cost-effective solution is to use Amazon S3 cross-region replication (CRR) to copy transaction logs to a bucket in the new region. S3 CRR is a relatively inexpensive service, and the replication happens within a timeframe (seconds to minutes) that easily meets the 24-hour requirement. Option A is incorrect because replicating the entire application is significantly more expensive and complex than just replicating logs. Option B is incorrect because CloudFront is a content delivery network; it doesn't log financial transactions. Option C is incorrect because while Kinesis could stream the data, it would likely be more expensive than S3 CRR, and would require additional infrastructure to process and store the data in the new region to meet the regulatory requirement.

Answer 33

B The discussion overwhelmingly supports option B as the simplest solution. Options A and C involve extra steps like unloading/loading data to S3 or creating a DBLINK and inserting data between clusters, thus increasing complexity and potential downtime. Option D, while potentially valid, is not as straightforward as directly restoring the table from a snapshot using a new table name. Option B efficiently restores the data to a new table, then replaces the old one, minimizing user impact.

Answer 34

D The correct answer is D because a QuickSight story allows for the sequential presentation of multiple analyses, effectively showing the data's evolution over the four weeks. This directly addresses the requirement to visualize the change in feedback over time. Option A is incorrect because while it allows viewing data by date range, it doesn't inherently provide a sequential, comparative view of the changes across weeks in a single, easily digestible presentation. Option B is incorrect as a pivot table is primarily for data aggregation and summarization; it doesn't offer a built-in mechanism to visually showcase the *change* over time as effectively as a story. Option C is incorrect because a dashboard displays multiple visualizations simultaneously, but doesn't inherently provide a sequential narrative of how the data changed over the four weeks. While filters could be used, the process would be cumbersome compared to using a story.

Answer 35

D The discussion indicates that D is the correct answer. The latency increase is happening during the processing of messages from Kinesis, not during ingestion into Kinesis. The Kinesis Client Library (KCL) application uses DynamoDB to store checkpoints. Increased throughput on the DynamoDB checkpoint table would improve the performance of the KCL application by reducing the time it takes to write checkpoints, which directly addresses the latency issue described. Option A is incorrect because the problem isn't with Kinesis's throughput; the stream already has sufficient shards. Option B is incorrect because the EC2 instances are underutilized (25% CPU), indicating that increasing their size would not improve processing speed. Option C is also incorrect because while scaling up EC2 instances could help, the problem is not a lack of capacity; it's specifically a bottleneck in checkpointing to DynamoDB.

Answer 36

C The discussion indicates that option C is the best solution due to the limitations of other options. Option A is incorrect because Kinesis has a 1MB record size limit, which is smaller than the maximum image size of 10MB. Option B is incorrect because using RDS for image storage is inefficient and slow for this scale of uploads. Option D is also unsuitable because EMR with Spark Streaming is an overkill for this use case and introduces unnecessary complexity and latency. Option C leverages S3 for storage, Lambda for processing, and a database for metadata, which is a common and efficient approach for handling high-volume image uploads and processing. The Lambda function processes the images asynchronously, allowing for quick uploads and meeting the one-minute display requirement.

Answer 37

D The correct answer is D because in a star schema, the fact table (ORDERS) is often joined with multiple dimension tables. To optimize query performance, the distribution key should be the foreign key referencing the largest and most frequently joined dimension table. This ensures data locality and reduces the amount of data that needs to be transferred between nodes during joins. Option A is incorrect because while even distribution is desirable, it's secondary to selecting the correct dimension table's key. Option B is incorrect because it only considers the size of the dimension table, not the frequency of joins. Option C is incorrect because choosing the smallest dimension table is likely to result in poor query performance due to increased data movement.

Answer 38

B The correct answer is B because using a unique log file prefix for each import job ensures that each import generates a separate log file in S3. This directly addresses the requirement to keep a record of each import. Option A is incorrect because using the same prefix creates a single, potentially overwritten log file, making it impossible to track individual imports. Option C is incorrect because while checksums are unique to file content, they aren't designed for creating unique filenames for logs and may be computationally expensive. Option D is incorrect because it is not a low-cost solution and adds extra complexity compared to leveraging the built-in features of AWS Import/Export. It also means that the log is created after the import and may not be completely reliable if an error occurred during import itself.

Answer 39

B The best answer is B because the problem is the processing time of large gzip-compressed files. Gzip, while offering high compression, is computationally expensive to decompress, especially for large files processed frequently. Snappy or bzip2 offer faster decompression speeds, even if the compression ratio is slightly lower. This makes them better suited for "hot data" accessed frequently, as described in the problem. A is incorrect because reducing HDFS block size doesn't directly address the compression issue; it impacts data locality and parallel processing but doesn't inherently speed up decompression. C is incorrect because storing decompressed CSV files increases storage costs significantly and doesn't address the underlying performance bottleneck of decompression during processing. While it would improve processing speed, the cost/benefit is unfavorable. D is incorrect while Avro is a good choice for data storage and processing in Hadoop, it is not a direct replacement for gzip and doesn't inherently solve the decompression speed issue, making B the more targeted solution. While Avro offers its own compression, the primary problem is the slow decompression of gzip, which B directly addresses.

Answer 40

B The most cost-effective solution is B: Install Presto on the EMR cluster where Hive sits. Presto is a distributed SQL query engine that can connect to various data sources, including MySQL, Redshift, and Hive. Configuring connectors allows querying all three sources within a single query, eliminating the need for data movement or complex custom programming. Option A is inefficient as it involves moving large amounts of data to S3 and then to Redshift. Option C is expensive due to the cost of running an Elasticsearch cluster. Option D requires custom programming and potentially significant processing overhead on the EC2 instance. While technically feasible, these options are less cost-effective than leveraging Presto's existing capabilities within the customer's existing EMR infrastructure.

Answer 41

B The correct answer is B because creating a manual snapshot provides a point-in-time backup of the entire Redshift cluster, allowing for a full restoration to the state before the upgrade in case of problems. This is a comprehensive and readily restorable backup. Option A is incorrect because unloading data to S3 only backs up the data, not the schema or cluster configuration. Restoring from S3 would require reloading the data, and recreating all the DDL, which is a time-consuming and error-prone process. Option C is incorrect because relying on an automated snapshot may not capture the most up-to-date state. Automated snapshots happen on a schedule and the exact timing might not immediately precede the upgrade, potentially leading to data loss. Option D is incorrect because this command is used to check the status of an existing snapshot, not to create one. It's useful after creating a snapshot but doesn't provide the necessary backup before the upgrade begins.

Answer 42

A The correct answer is A because: Blood type prediction is a multi-class classification problem (A, B, AB, O). The target attribute (blood type) is categorical. Options B and C are incorrect because blood type is not a numeric value and is not binary. Option D is incorrect because, while K-Nearest Neighbors *can* perform multi-class classification, the discussion clarifies that Amazon Machine Learning (the specified platform) does not include KNN.

Answer 43

A The correct answer is A because it uses a transient EMR cluster, which only runs when needed to generate the daily graph, minimizing compute costs. Storing the graph in Amazon S3 allows for inexpensive serving of the static image to website visitors. Option B is incorrect because MicroStrategy is a commercial BI tool, adding significant licensing costs. Option C and D are incorrect because they involve continuously running EMR clusters, which are significantly more expensive than using transient clusters that are only active for a short period each day to generate the graph. The continuous running nature of options C and D is unnecessary for a simple daily summary graph intended for marketing purposes.

Answer 44

A The correct answer is A because EMRFS consistent view is specifically designed to address the problem of inconsistent reads from Amazon S3 within a short timeframe. It ensures that subsequent reads see the data written in previous cycles within the same EMRFS session. Option B (using AWS Data Pipeline) is not the best solution for this specific problem. While Data Pipeline can orchestrate the workflow, it doesn't inherently solve the data consistency issue between quick read-write cycles. Options C and D are incorrect because they refer to Hadoop configuration properties that don't directly provide the needed S3 data consistency within an EMRFS context. They are not the recommended approach for this specific scenario.

Answer 45

A The correct answer is A because ETags are hexadecimal values that represent the data integrity of an object in Amazon S3. Comparing the ETags of the source and destination objects ensures that the data hasn't been altered during the centralization process. Option B is incorrect because there is no S3 CompareObjects API. Option C is incorrect because comparing SIG v4 signatures only verifies the authenticity of the request, not the data integrity of the object. Option D is incorrect because comparing only the size of the objects doesn't guarantee that the data itself is identical. Two files could have the same size but different content.

Answer 46

A The correct answer is A because: * **C instance types (Compute Optimized):** These are suitable for machine learning algorithms which often benefit from high CPU performance. While not explicitly stated as *the best*, they offer a good balance of price and performance for the described workload. * **R instance types (Memory Optimized):** These are well-suited for ad-hoc querying, where fast memory access is crucial for quick response times. The relatively small dataset size (less than 10GB) means that memory optimization would be advantageous over other instance types. Options B, C, and D are incorrect because: * **B:** While R instances are good for memory-intensive tasks (making them suitable for ad-hoc queries), G2 instances (GPU optimized) are not the best choice for this use case given that the dataset is small. * **C:** T and M instances are general purpose, offering neither the compute optimization needed for Machine Learning nor the memory optimization beneficial to ad-hoc querying. * **D:** D and I instances are storage optimized, which is not the primary concern for this scenario, where processing power and memory are more important.

Answer 47

A The correct answer is A because: * **DynamoDB's scalability:** DynamoDB is a NoSQL database designed for high throughput and scalability, handling large datasets and high transaction rates exceeding the capabilities of RDS. The question specifies data exceeding 100TB, which is beyond the capacity limitations of RDS. * **Strong consistency:** DynamoDB supports strongly consistent reads, fulfilling the requirement for certain read operations to have this property. * **Durability and reliability:** DynamoDB offers durable and reliable storage, ensuring data persistence. Option B is incorrect because RDS instances have storage limitations and are not designed to handle datasets exceeding 100TB. Sizing for maximum transaction rate is also not a best practice; DynamoDB's auto-scaling handles variable loads more efficiently. Option C is incorrect because EMR with HDFS is primarily for batch processing and analytics, not for the high-throughput transactional workload described. While HDFS can be configured for high durability, it does not inherently provide the strong consistency required. Option D is incorrect because Amazon Redshift is a data warehouse optimized for analytical queries, not for the transactional requirements described. While synchronous replication to S3 provides durability, Redshift's architecture isn't suitable for the high transaction rates and strong consistency needs.

Answer 48

A A is the correct answer because Elasticsearch is a search engine designed for this purpose. Ingesting and indexing the data from the Kinesis stream into Elasticsearch allows for efficient searching of historical support cases. B is incorrect because ElastiCache is a caching service, not a search engine. While stemming and tokenization are preprocessing steps for search, storing the results in ElastiCache wouldn't allow for efficient searching. C is incorrect because while DynamoDB is a NoSQL database, it is not optimized for full-text search. While you can use secondary indexes, searching across large volumes of text data would be inefficient. D is incorrect because Amazon S3 is an object storage service, not designed for searching. Although columnar storage can improve query performance in some cases, S3 itself lacks the search capabilities needed for this task.

Answer 49

C C is correct because AWS Glue Data Catalog is a fully managed service specifically designed for creating and managing data catalogs. Scheduling crawlers within Glue allows for automated population of the catalog from various data sources, including those listed (RDS, Redshift, and S3). This solution minimizes administration compared to options A, B, and D which require more manual setup, configuration, and maintenance. A is incorrect because while using DynamoDB and Lambda is possible, it requires significant custom development and maintenance to handle the complexities of metadata management and data discovery. B is also incorrect for similar reasons to A. Using a generic Amazon database requires custom coding and management to build a data catalog solution. D is incorrect because setting up and managing an Apache Hive metastore on EC2 requires significant operational overhead, contradicting the requirement for minimal administration. It also lacks the managed service benefits and integration capabilities of AWS Glue Data Catalog.

Practice Questions - Amazon AWS Certified Big Data - Specialty Flashcards

(85 cards)