Book - Chapter 3 Flashcards

(12 cards)

1
Q

A large enterprise using GCP has recently acquired a startup that has an IoT platform. The acquiring company wants to migrate the IoT platform from an on-premises data center to GCP and wants to use Google Cloud managed services whenever possible. What GCP service would you recommend for ingesting IoT data?

A. Cloud Storage
B. Cloud SQL
C. Cloud Pub/Sub
D. BigQuery streaming inserts

A

C. The correct answer is C, Cloud Pub/Sub, which is a scalable, managed messaging queue that is typically used for ingesting high-volume streaming data.

Option A is incorrect; Cloud Storage does not support streaming inserts, but Cloud Pub/Sub is designed to scale for high-volume writes and has other features useful for stream processing, such as acknowledging and processing a message.

Option B is incorrect; Cloud SQL is not designed to support high volumes of low-latency writes like the kind needed in IoT applications.

Option D is incorrect; although BigQuery has streaming inserts, the database is designed for analytic operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

You are designing a data pipeline to populate a sales data mart. The sponsor of the project has had quality control problems in the past and has defined a set of rules for filtering out bad data before it gets into the data mart. At what stage of the data pipeline would you implement those rules?

A. Ingestion
B. Transformation
C. Storage
D. Analysis

A

B. The correct answer is B. The transformation stage is where business logic and filters are
applied.

Option A is incorrect; ingestion is when data is brought into the GCP environment.

Option C is incorrect—that data should be processed, and problematic data removed
before storing the data.

Answer D is incorrect; by the analysis stage, data should be fully transformed and available for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A team of data warehouse developers is migrating a set of legacy Python scripts that have
been used to transform data as part of an ETL process. They would like to use a service
that allows them to use Python and requires minimal administration and operations support.
Which GCP service would you recommend?

A. Cloud Dataproc
B. Cloud Dataflow
C. Cloud Spanner
D. Cloud Dataprep

A

B. The correct answer is B. Cloud Dataflow supports Python and is a serverless platform.

Option A is incorrect because, although it supports Python, you have to create and configure clusters.

Option C is incorrect; Cloud Spanner is a horizontally scalable global relational database.

Option D is incorrect; Cloud Dataprep is an interactive tool for preparing data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

You are using Cloud Pub/Sub to buffer records from an application that generates a stream
of data based on user interactions with a website. The messages are read by another service
that transforms the data and sends it to a machine learning model that will use it for training.
A developer has just released some new code, and you notice that messages are sent
repeatedly at 10-minute intervals. What might be the cause of this problem?

A. The new code release changed the subscription ID.
B. The new code release changed the topic ID.
C. The new code disabled acknowledgments from the consumer.
D. The new code changed the subscription from pull to push.

A

B. The correct answer is B. Cloud Dataflow supports Python and is a serverless platform.

Option A is incorrect because, although it supports Python, you have to create and configure clusters.

Option C is incorrect; Cloud Spanner is a horizontally scalable global relational database.

Option D is incorrect; Cloud Dataprep is an interactive tool for preparing data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

It is considered a good practice to make your processing logic idempotent when consuming
messages from a Cloud Pub/Sub topic. Why is that?

A. Messages may be delivered multiple times.
B. Messages may be received out of order.
C. Messages may be delivered out of order.
D. A consumer service may need to wait extended periods of time between the delivery of
messages.

A

A. The correct answer is A; messages may be delivered multiple times and therefore
processed multiple times. If the logic were not idempotent, it could leave the application in
an incorrect state, such as that which could occur if you counted the same message multiple
times.

Options B and C are incorrect; the order of delivery does not require idempotent
operations.

Option D is incorrect; the time between messages is not a factor in requiring
logic to be idempotent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Group of IoT sensors is sending streaming data to a Cloud Pub/Sub topic. A Cloud Dataflow
service pulls messages from the topic and reorders the messages sorted by event time.
A message is expected from each sensor every minute. If a message is not received from a
sensor, the stream processing application should use the average of the values in the last
four messages. What kind of window would you use to implement the missing data logic?

A. Sliding window
B. Tumbling window
C. Extrapolation window
D. Crossover window

A

A. The correct answer is A; a sliding window would have the data for the past four
minutes.

Option B is incorrect because tumbling windows do not overlap, and the requirement calls for using the last four messages so the window must slide.

Options C and D are not actually names of window types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Your department is migrating some stream processing to GCP and keeping some on premises. You are tasked with designing a way to share data from on-premises pipelines that use Kafka with GPC data pipelines that use Cloud Pub/Sub. How would you do that?

A. Use CloudPubSubConnector and Kafka Connect
B. Stream data to a Cloud Storage bucket and read from there
C. Write a service to read from Kafka and write to Cloud Pub/Sub
D. Use Cloud Pub/Sub Import Service

A

A. The correct answer is A; you should use CloudPubSubConnector and Kafka Connect.
The connector is developed and maintained by the Cloud Pub/Sub team for this purpose.

Option B is incorrect since this is a less direct and efficient method. Option C requires maintaining a service.

Option D is incorrect because there is no such service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A team of developers wants to create standardized patterns for processing IoT data. Several
teams will use these patterns. The developers would like to support collaboration and facilitate the use of patterns for building streaming data pipelines. What component should they use?

A. Cloud Dataflow Python Scripts
B. Cloud Dataproc PySpark jobs
C. Cloud Dataflow templates
D. Cloud Dataproc templates

A

C. The correct answer is C. Use Cloud Dataflow templates to specify the pattern and
provide parameters for users to customize the template.

Option A is incorrect since this would require users to customize the code in the script.

Options B and D are incorrect because Cloud Dataproc should not be used for this requirement.

Option D is incorrect because there are no Cloud Dataproc templates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

You need to run several map reduce jobs on Hadoop along with one Pig job and four PySpark jobs. When you ran the jobs on premises, you used the department’s Hadoop cluster. Now you are running the jobs in GCP. What configuration for running these jobs would you recommend?

A. Create a single cluster and deploy Pig and Spark in the cluster.
B. Create one persistent cluster for the Hadoop jobs, one for the Pig job and one for the
PySpark jobs.
C. Create one cluster for each job, and keep the cluster running continuously so that you
do not need to start a new cluster for each job.
D. Create one cluster for each job and shut down the cluster when the job completes.

A

D. The correct answer is D. You should create an ephemeral cluster for each job and delete
the cluster after the job completes. Option A is incorrect because that is a more complicated
configuration. Option B is incorrect because it keeps the cluster running instead of shutting
down after jobs complete. Option C is incorrect because it keeps the clusters running after
the jobs complete.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You are working with a group of genetics researchers analyzing data generated by gene
sequencers. The data is stored in Cloud Storage. The analysis requires running a series of
six programs, each of which will output data that is used by the next process in the pipeline.
The final result set is loaded into BigQuery. What tool would you recommend for
orchestrating this workflow?

A. Cloud Composer
B. Cloud Dataflow
C. Apache Flink
D. Cloud Dataproc

A

A. The correct answer is A, Cloud Composer, which is designed to support workflow
orchestration.

Options B and C are incorrect because they are both implementations of the Apache Beam model that is used for executing stream and batch processing program.

Option D is incorrect; Cloud Dataproc is a managed Hadoop and Spark service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

An on-premises data warehouse is currently deployed using HBase on Hadoop. You want
to migrate the database to GCP. You could continue to run HBase within a Cloud Dataproc
cluster, but what other option would help ensure consistent performance and support the
HBase API?

A. Store the data in Cloud Storage
B. Store the data in Cloud Bigtable
C. Store the data in Cloud Datastore
D. Store the data in Cloud Dataflow

A

B. The correct answer is B. The data could be stored in Cloud Bigtable, which provides
consistent, scalable performance.

Option A is incorrect because Cloud Storage is an object storage system, not a database.

Option C is incorrect, since Cloud Datastore is a document- style NoSQL database and is not suitable for a data warehouse.

Option D is incorrect;Cloud Dataflow is not a database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The business owners of a data warehouse have determined that the current design of the
data warehouse is not meeting their needs. In addition to having data about the state of
systems at certain points in time, they need to know about all the times that data changed
between those points in time. What kind of data warehousing pipeline should be used to
meet this new requirement?

A. ETL
B. ELT
C. Extraction and load
D. Change data capture

A

D. The correct answer is D. With change data capture, each change is a source system
captured and recorded in a data store.

Options A, B, and C all capture the state of source systems at a point in time and do not capture changes between those times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly