Analysis Flashcards
From which sources can the input for Kinesis analytics be obtained ?
- MySQL and Kinesis Data Streams
- DynamoDB and Kinesis Firehose deliver streams
- Kinesis data streams and Kinesis Firehose delivery streams
- Kinesis Data Streams and DynamoDB
Kinesis data streams and Kinesis Firehose delivery streams (Kinesis Analytics can only monitor streams from Kinesis, but both data streams and Firehose are supported.)
After real-time analysis has been performed on the input source, where may you send the processed data for further processing?
Kinesis Data Stream or Firehose (While you might in turn connect S3 or Redshift to your Kinesis Analytics output stream, Kinesis Analytics must have a stream as its input, and a stream or Lambda function as its output.)
If a record arrives late to your application during stream processing, what happens to it?
The record is written to the error stream
You have heard from your AWS consultant that Amazon Kinesis Data Analytics elastically scales the application to accommodate the data throughput. What though is default capacity of the processing application in terms of memory?
32 GB (Kinesis Data Analytics provisions capacity in the form of Kinesis Processing Units (KPU). A single KPU provides you with the memory (4 GB) and corresponding computing and networking. The default limit for KPUs for your application is eight.)
You have configured data analytics and have been streaming the source data to the application. You have also configured the destination correctly. However, even after waiting for a while, you are not seeing any data come up in the destination. What might be a possible cause?
- Issue with IAM role
- Mismatched name for the output stream
- Destination service is currently unavailable
- Any of above
Any of above
How can you ensure maximum security for your Amazon ES cluster?
- Bind with a VPC
- Use security groups
- Use IAM policies
- Use access policies associated with the Elasticsearch domain creation
- All of the above
All of the above
As recommended by AWS, you are going to ensure you have dedicated master nodes for high performance. As a user, what can you configure for the master nodes?
- The count and instance types of the master nodes
- The EBS volume associated with the node
- The upper limit of network traffic / bandwidth
- All of the above
The count and instance types of the master nodes
Which are supported ways to import data into your Amazon ES domain?
- Directly from an RDS instance
- Via Kinesis, Logstash, and Elasticsearch’s API’s
- Via Kinesis, SQS, and Beats
- Via SQS, Firehose, and Logstash
Via Kinesis, Logstash, and Elasticsearch’s API’s
What can you do to prevent data loss due to nodes within your ES domain failing?
Maintain snapshots of the Elasticsearch Service domain (Amazon ES created daily snapshots to S3 by default, and you can create them more often if you wish.)
You are going to setup an Amazon ES cluster and have it configured in your VPC. You want your customers outside your VPC to visualize the logs reaching the ES using Kibana. How can this be achieved?
- Use a reverse proxy
- Use a VPN
- Use VPC
- Use VPC Direct Connect
- Any of the above
Any of the above
As a Big Data analyst, you need to query/analyze data from a set of CSV files stored in S3. Which of the following serverless services helps you with this?
- AWS Glacier
- AWS EMR
- AWS Athena
- AWS Redshift
AWS Athena
What are two columnar data formats supported by Athena?
Parquet and ORC
Your organization is querying JSON data stored in S3 using Athena, and wishes to reduce costs and improve performance with Athena. What steps might you take?
Convert the data from JSON to ORC format, and analyze the ORC data with Athena
When using Athena, you are charged separately for using the AWS Glue Data Catalog. True or False ?
True
Which of the following statements is NOT TRUE regarding Athena pricing?
- Amazon Athena charges you for cancelled queries
- Amazon Athena charges you for failed queries
- You will get charges less when using a columnar format
- Amazon Athena is priced per query and charges based on the amount of data scanned by the query
Amazon Athena charges you for failed queries
You are working as Big Data Analyst of a data warehousing company. The company uses RedShift clusters for data analytics. For auditing and compliance purpose, you need to monitor API calls to RedShift instance and also provide secured data.
Which of the following services helps in this regard ?
- CloudTrail logs
- CloudWatch logs
- Redshift Spectrum
- AmazonMQ
CloudTrail logs
You are working as a Big Data analyst of a Financial enterprise which has a large data set that needs to have columnar storage to reduce disk IO. It is also required that the data should be queried fast so as to generate reports. Which of the following service is best suited for this scenario?
- DynamoDB
- RDS
- Athena
- Redshift
Redshift
You are working for a data warehouse company that uses Amazon RedShift cluster. It is required that VPC flow logs is used to monitor all COPY and UNLOAD traffic of the cluster that moves in and out of the VPC. Which of the following helps you in this regard ?
- By using
- By enabling Enhanced VPC routing on the Amazon Redshift cluster
- By using Redshift WLM
- By enabling audit logging in the Redshift cluster
By enabling Enhanced VPC routing on the Amazon Redshift cluster
You are working for a data warehousing company that has large datasets (20TB of structured data and 20TB of unstructured data). They are planning to host this data in AWS with unstructured data storage on S3. At first they are planning to migrate the data to AWS and use it for basic analytics and are not worried about performance. Which of the following options fulfills their requirement?
- node type ds2.xlarge
- node type ds2.8xlarge
- node type dc2.8xlarge
- node type dc2.xlarge
node type ds2.xlarge (Since they are not worried about performance, storage (ds) is more important than computing power (dc,) and expensive 8xlarge instances aren’t necessary.)
Which of the following services allows you to directly run SQL queries against exabytes of unstructured data in Amazon S3?
- Athena
- Redshift Spectrum
- Elasticache
- RDS
Redshift Spectrum
How many concurrent queries can you run on a Redshift cluster?
50
What are some the benefits and use cases of Columnar Databases? (Choose 2)
- They’re ideal for ‘needle in a haystack’ queries.
- Compression, as it helps with performance and provides a lower total cost of ownership.
- They’re ideal for small amounts of data.
- They store binary objects quite well.
- They are ideal for Online Analytical Processing (OLAP).
- Compression, as it helps with performance and provides a lower total cost of ownership.
- They are ideal for Online Analytical Processing (OLAP).
(Compression algorithms supported in Redshift help with performance and also help reduce the amount of data stored in a Redshift cluster, which helps lower the total cost of ownership.)
In your current data warehouse, BI analysts consistently join two tables: the customer table and the orders table. The column they JOIN on (and common to both tables) is called customer_id. Both tables are very large, over 1 billion rows. Besides being in charge of migrating the data, you are also responsible for designing the tables in Redshift. Which distribution style would you choose to achieve the best performance when the BI analysts run queries that JOIN the customer table and orders table using customer_id?
Key
(The KEY distribution style will help achieve the best performance with in this case. In Redshift, rows are distributed according to the values in one column. The leader node will attempt to place matching values on the same node slice. If you distribute a pair of tables on the joining keys, the leader node collocates the rows on the slices according to the values in the joining columns so that matching values from the common columns are physically stored together.)
What is the most effective way to merge data into an existing table?
- Execute an UPSERT.
- Use a staging table to replace existing rows or update specific rows.
- UNLOAD data from Redshift into S3, use EMR to ‘merge’ new data files with the unloaded data files, and copy the data into Redshift.
- Connect the source table and the target Redshift table via a replication tool and run direct INSERTS, UPDATES into the target Redshift table.
-Use a staging table to replace existing rows or update specific rows.
(You can efficiently update and insert new data by loading your data into a staging table first. Redshift does not support an UPSERT. To merge data into an existing table, you can perform a merge operation by loading your data into a staging table and then joining the staging table with your target table for an UPDATE statement and an INSERT statement.)