Big Data Solutions Flashcards
What is Cloud Pub/Sub?
A messaging event service that is fully managed and used for data pipelines
Which services align with the Cloud Dataflow pipeline?
Cloud Dataflow is based on Apache Beam
Which services align with the Cloud Dataproc pipeline?
Cloud Dtaproc is used for Apache Spark and Hadoop clusters
What is Big Query?
BigQuery is a fully managed anaylitics service used to help analyze large amounts of data
What language are queries executed in BigQuery?
SQL language
What services can Cloud Pub/Sub integrate with?
Cloud Logs, Cloud API, Cloud Dataflow, Cloud Storage, and Compute Engine
What is the primary difference between Cloud Dataflow and Cloud Dataproc?
You must provision your own servers in Cloud Dataproc
What types of instances are available for Cloud Dataproc jobs?
Compute Engine instances, preemptible instances
What is Cloud IoT Core?
Cloud IoT Core is a fully managed Google service that offers secure connections, management, and ingestion of data from IoT devices
What type of Pub/Sub protocol does Cloud IoT Core use?
It typically uses MQTT Pub/Sub protocol more effectively than HTTP although it can use both
What can you do with Cloud IoT Core?
You can register, configure, update, and control IoT devices
How much data can be loaded into BigQuery?
BigQuery can be scaled to petabytes of data, although it must always contain at least 1 dataset
What is a publisher?
An application that can create and send messages to a topic
What is a topic?
A topic is a resource to which messages are sent by publishers
What is a message?
Data a publisher will send to a topic (data in transit)
What is a subscription?
Subscriptions are the stream of messages from a single topic to be delivered to the subscribing application
What are common uses for Cloud Pub/Sub?
Distributed Even notifications, Balancing workloads, and Logging
What is a data pipeline?
A pipeline is a piece of code that determines how we wish to process our data
What is Cloud Dataflow typically used for?
Cloud Dataflow ingests data from Cloud Pub/Sub and transforms it into the data that we need to use as a part of the data pipeline
What should you do if data being uploaded into BigQuery is only being used temporarily?
Data tables in BigQuery can be given a table expiriation in order to cut down on storage costs when the data set is being created
How long must data be in BigQuery for it’s storage price to drop?
Data must be in BigQuery for 90 days unedited before it’s storage costs drop to 50%
What is an advantage of using CLoud Dataproc clusters?
Clusters are only used for job’s lifetime and are therefore cost effective
What types of machines are available in a Cloud Dataproc cluster?
Master notes, worker nodes, and preeptible worker nodes
What type of cluster configuration in Cloud Dataproc has one master nodes and N worker nodes?
Standard; In flight jobs will fail and the file system will be inaccessible until the master node reboots if there is a compute failure