Big Data Solutions Flashcards

1
Q

What is Cloud Pub/Sub?

A

A messaging event service that is fully managed and used for data pipelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which services align with the Cloud Dataflow pipeline?

A

Cloud Dataflow is based on Apache Beam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which services align with the Cloud Dataproc pipeline?

A

Cloud Dtaproc is used for Apache Spark and Hadoop clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Big Query?

A

BigQuery is a fully managed anaylitics service used to help analyze large amounts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What language are queries executed in BigQuery?

A

SQL language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What services can Cloud Pub/Sub integrate with?

A

Cloud Logs, Cloud API, Cloud Dataflow, Cloud Storage, and Compute Engine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the primary difference between Cloud Dataflow and Cloud Dataproc?

A

You must provision your own servers in Cloud Dataproc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What types of instances are available for Cloud Dataproc jobs?

A

Compute Engine instances, preemptible instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Cloud IoT Core?

A

Cloud IoT Core is a fully managed Google service that offers secure connections, management, and ingestion of data from IoT devices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type of Pub/Sub protocol does Cloud IoT Core use?

A

It typically uses MQTT Pub/Sub protocol more effectively than HTTP although it can use both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can you do with Cloud IoT Core?

A

You can register, configure, update, and control IoT devices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How much data can be loaded into BigQuery?

A

BigQuery can be scaled to petabytes of data, although it must always contain at least 1 dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a publisher?

A

An application that can create and send messages to a topic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a topic?

A

A topic is a resource to which messages are sent by publishers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a message?

A

Data a publisher will send to a topic (data in transit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a subscription?

A

Subscriptions are the stream of messages from a single topic to be delivered to the subscribing application

17
Q

What are common uses for Cloud Pub/Sub?

A

Distributed Even notifications, Balancing workloads, and Logging

18
Q

What is a data pipeline?

A

A pipeline is a piece of code that determines how we wish to process our data

19
Q

What is Cloud Dataflow typically used for?

A

Cloud Dataflow ingests data from Cloud Pub/Sub and transforms it into the data that we need to use as a part of the data pipeline

20
Q

What should you do if data being uploaded into BigQuery is only being used temporarily?

A

Data tables in BigQuery can be given a table expiriation in order to cut down on storage costs when the data set is being created

21
Q

How long must data be in BigQuery for it’s storage price to drop?

A

Data must be in BigQuery for 90 days unedited before it’s storage costs drop to 50%

22
Q

What is an advantage of using CLoud Dataproc clusters?

A

Clusters are only used for job’s lifetime and are therefore cost effective

23
Q

What types of machines are available in a Cloud Dataproc cluster?

A

Master notes, worker nodes, and preeptible worker nodes

24
Q

What type of cluster configuration in Cloud Dataproc has one master nodes and N worker nodes?

A

Standard; In flight jobs will fail and the file system will be inaccessible until the master node reboots if there is a compute failure

25
Q

What type of cluster configuration in Cloud Dataproc has includes 3 master nodes and N worker nodes?

A

High Availability; Designed to allow uninterrupted operations in the event of a compute engine failure

26
Q

What type of cluster configuration for Cloud Dataproc combines both master and worker nodes?

A

Single Node; Not suitable for large data processing and should be used for PoC or small scale non-critical data processesing

27
Q

What components from Apache Hadoop are auto installed onto a Cloud Dataproc ecosystem?

A

Apache Spark, Apache Hadoop, Apache Pig, Apache Hive, Python, Java, and the Hadoop Distributed File System

28
Q

What is MQTT?

A

MQTT is a Publish/Subscribe protocol that is often used with devices because it is data focused (often considered better for IoT jobs)

29
Q

What is HTTP in relation to Pub/Sub?

A

HTTP is a Publish/Subscribe protocol that is connectionless and can maintain a connection to the core. They are considered document focused.

30
Q

How do protocols communicate with Cloud IoT Core?

A

They communicate via a protocol bridge which provides MQTT and HTTP protocol endpoints, automatic load balancing and Global data access via Pub/Sub