Exam Preparation Flashcards

1
Q

What are the four stages of data lifecycle?

A

ingest, storage, process and Analyze, and explore and Visualise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is streaming data?

A

Streaming data is a set of data that is sent in small messages that are transmitted continuously from the data source. Streaming data may be telemetry data, which is data generated at regular intervals, and event data, which is data generated in response to a particular event. Stream ingestion services need to deal with potentially late and missing data. Streaming data is often ingested using Cloud Pub/Sub.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is bulk data?

A

Batch data is ingested in bulk, typically in files. Examples of batch data ingestion include uploading files of data exported from one application to be processed by another. Both batch and streaming data can be transformed and processed using Cloud Dataflow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the technical considerations to consider when choosing a data store?

A

These factors include the volume and velocity of data, the type of structure of the data, access control require- ments, and data access patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Know the three levels of structure of data.

A

Unstructured, semi-structured and structured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What products store structured data in GCP?

A

CloudSQL and CloudSpanner for transactional

BigQuery for analytical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What products store semi-structured data in GCP?

A

If data access requires full index- ing, Cloud Datastore, else BigTable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What products store unstructured data in GCP?

A

Cloud Storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the four types of NoSQL databases?

A

Four types of NoSQL databases are key-value, document, wide-column, and graph databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some concerns about streaming data?

A

Stream ingestion services need to deal with potentially late and missing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What tool can transform batch and streaming data?

A

Both batch and streaming data can be transformed and processed using Cloud Dataflow.`

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What SQL does CloudSQL support?

A

Cloud SQL supports MySQL, PostgreSQL, and SQL Server (beta).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is CloudSQL initially setup for availability?

A

Cloud SQL instances are created in a single zone by default, but they can be created for high availability and use instances in multiple zones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you improve reads in CloudSQL?

A

Use read replicas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is CloudSpanner?

A

Cloud Spanner is a horizontally scalable relational database that automatically replicates data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the three types of replicas in CloudSpanner?

A

Three types of replicas are read-write replicas, read-only replicas, and witness replicas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can you avoid hot-spotting in CloudSpanner?

A

Avoid hotspots by not using consecutive values for primary keys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What kind of configuration does CloudSpanner have?

A

Cloud Spanner is configured as regional or multi-regional instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is BigTable?

A

Cloud Bigtable is a wide-column NoSQL database used for high-volume databases that require sub-10 ms latency (fast write).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What use-cases are their for BigTable?

A

Cloud Bigtable is used for IoT, time-series, finance, and similar applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How do you make BigTable highly available?

A

For multi-regional high availability, you can create a replicated cluster in another region. All data is replicated between clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How is data stored in Bigtable?

A

Data is stored in Bigtable lexicographically by row-key, which is the one indexed column in a Bigtable table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you improve reads in BigTable?

A

Keeping related data in adjacent rows can help make reads more efficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Cloud Firestore?

A

Cloud Firestore is a document database that is replacing Cloud Datastore as the managed document database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is BigQuery?

A

BigQuery is an analytics database that uses SQL as a query language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are datasets in BigQuery?

A

Datasets are the basic unit of organization for sharing data in BigQuery. A dataset can have multiple tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What SQL does BigQuery support?

A

A dataset can have multiple tables. BigQuery supports two dialects of SQL: legacy and standard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is streaming insets on BigQuery?

A

Streaming inserts allow adding one row at a time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does Stackdriver do in BigQuery?

A

Stackdriver is used for monitoring and logging in BigQuery. Stackdriver

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How are BigQuery costs managed?

A

BigQuery costs are based on the amount of data stored, the amount of data streamed, and the workload required to execute queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is Cloud Memorystore?

A

Cloud Memorystore is a managed Redis service. Redis instances can be created using the Cloud Console or gcloud commands

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

When is Cloud Memorystore under memory pressure?

A

When the memory used by Redis exceeds 80 percent of system memory, the instance is considered under memory pressure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is Google Cloud Storage?

A

It’s an object storage, like S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are objects stored in Google Cloud Storage?

A

A bucket, that share access controls at the bucket level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are the four storage tiers of Google Cloud Storage?

A

The four storage tiers are Regional, Multi-regional, Nearline, and Coldline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What data structures use DAGs?

A

Data pipelines are modeled as directed acyclic graphs (DAGs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What are the four stages of the data pipeline?

A

Ingestion - bringing data into the GCP environment. Transformation - mapping data from the structure used in the source system to the structure used in the storage and analysis stages of the data pipeline.

Cloud Storage can be used as both the staging area for storing data immediately after ingestion and also as a long-term store for transformed data.

BigQuery and Cloud Storage treat data as external tables and query them. Cloud Dataproc can use Cloud Storage as HDFS-compatible storage. Analysis can take on several forms, from simple SQL querying and report generation to machine learning model training and data science analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the common patterns in data warehousing pipelines?

A

ETL
ELT
CDC
EL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are the unique considerations for streaming data?

A

Difference between event time and processing time, sliding and tumbling windows,
late- arriving data and watermarks, and missing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are the components of a typical ML pipeline?

A

This includes data ingestion, data preprocessing, feature engineering, model training and evaluation, and deployment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is Cloud Pub/Sub?

A

Cloud Pub/Sub is a managed message queue service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Does Cloud Pub/Sub scale as needed?

A

Cloud Pub/Sub will automatically scale as needed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

When are messaging queues used?

A

Messaging queues are used in distributed systems to decouple services in a pipeline. This allows one service to produce more output than the consum- ing service can process without adversely affecting the consuming service. This is especially helpful when one process is subject to spikes.

I.E Lots of messages = no worries, add to queue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is Cloud Dataflow?

A

Cloud Dataflow is a managed stream and batch processing service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

How does Cloud Dataflow work to help?

A

In the past, developers would typically create a stream processing pipeline (hot path) and a separate batch processing pipeline (cold path). Cloud Dataflow combines the two.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is Cloud Dataproc?

A

Cloud Dataproc is a managed Hadoop and Spark service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

How does Cloud Dataproc support on-prem migrations?

A

You can move your on-prem Hadoop to Dataproc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is cloud composer?

A

Cloud Composer is a managed service implementing Apache Airflow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What type of nodes does Dataproc support?

A

Cloud Dataproc clusters consist of two types of nodes: master nodes and worker nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What does cloud composer do?

A

Cloud Composer automates the scheduling and moni- toring of workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What do you need to do when migrating from on-premises Hadoop and Spark to GCP?

A

Do it incrementally.
Migrate HBase to Bigtable.
Manage the syncronization between the on-prem and cloud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is Compute Engine?

A

It’s like EC2, you have complete control over it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What is GKE?

A

Kubernetes is a container orchestration system, and Kubernetes Engine is a managed Kubernetes service. With Kubernetes Engine, Google maintains the cluster and assumes responsibility for installing and configuring the Kubernetes platform on the cluster. Kubernetes Engine deploys Kubernetes on managed instance groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is App Engine?

A

App Engine is GCP’s original platform-as-a-service (PaaS) offering. It’s elastic beanstalk.

55
Q

What is Cloud Functions?

A

Cloud Functions is a serverless, managed compute service for running code in response to events that occur in the cloud. Events are supported for Cloud Pub/ Sub, Cloud Storage, HTTP events, Firebase, and Stackdriver Logging.

56
Q

What is availability, reliability and scalability?

A

Availability is defined as the ability of a user to access a resource at a specific time. SLA = percentage of time a system is operational.

Reliability is defined as the probability that a system will meet service-level objectives for some duration of time. Reliability is often measured as the mean time between failures.

Scalability is the ability of a system to meet the demands of workloads as they vary over time.

57
Q

When do you use hybrid cloud computing?

A

The analytics hybrid cloud is used when transaction processing systems continue to run on premises and data is extracted and transferred to the cloud for analytic processing.

58
Q

When is an edge cloud used?

A

A variation of hybrid clouds is an edge cloud, which uses local computation resources in addition to cloud platforms. This architecture pattern is used when a network may not be reliable or have sufficient bandwidth to transfer data to the cloud. It is also used when low-latency processing is required.

59
Q

What is messaging?

A

Message brokers are services that provide three kinds of functionality: message validation, message transformation, and routing.

Message validation is the process of ensuring that messages received are correctly formatted. Message transformation is the process of mapping data to structures that can be used by other services. Message brokers can receive a message and use data in the message to determine where the message should be sent. Routing is used when hub-and-spoke message brokers are used.

60
Q

What are the four steps for migrating a data warehouse?

A

At a high level, the process of migrating a data warehouse involves four stages:

Assessing the current state of the data warehouse Designing the future state
Migrating data, jobs, and access controls to the cloud Validating the cloud data warehouse

61
Q

What are distributed processing architectures?

A

SOA is a distributed architecture that is driven by business operations and delivering business value. Typically, an SOA system serves a discrete business activity. SOAs are self-contained sets of services.

Microservices are a variation on SOA architecture. Like other SOA systems, microservice architectures use multiple, independent components and common communication protocols to provide higher-level business services. Serverless functions extend the principles of microservices by removing concerns for containers and managing runtime environments.

62
Q

Does Compute Engine support single instance and instance groups?

A

Yes, Compute Engine supports provisioning single instances or groups of instances, known as instance groups.

63
Q

What’s MiGs?

A

Managed instance groups (MIGs) consist of identically configured VMs

64
Q

What are the benefits of MiGs?

A

Autohealing

Support for multizone groups that provide for availability in spite of zone-level failures

Load balancing to distribute workload across all instances in the group

Autoscaling, which adds or removes instances in the group to accommodate increases and decreases in workloads

Automatic, incremental updates to reduce disruptions to workload processing

65
Q

What are containers in Kubernetes?

A

Containers are increasingly used to process workloads because they have less overhead than VMs and allow for finer-grained allocation of resources than VMs.

66
Q

What is Kubernetes Engine?

A

Kubernetes Engine is a managed Kubernetes service that provides container orchestration.

67
Q

How can Bigtable instances be provisioned?

A

Bigtable instances can be provisioned using the cloud console, the command-line SDK, and the REST API.

68
Q

Why use Bigtable?

A

Require high-volume, low-latency writes.

69
Q

What is Cloud IAM?

A

Cloud IAM provides fine-grained identity and access management for resources within GCP. Cloud IAM uses the concept of roles, which are collections of permissions that can be assigned to identities.

70
Q

What are some roles supplied by Cloud IAM?

A

Cloud IAM and include Owner, Editor, and Viewer roles

71
Q

What is the principle of least privilege?

A

the user should have the minimal privilege.

72
Q

What is the purpose of service accounts?

A

Service accounts are able to make API calls authorised by roles assigned to the service account.

73
Q

How are service accounts identified?

A

A service account is identified by a unique email address. These accounts are authenticated by two sets of public/private keys.

74
Q

What is data-at-rest encryption?

A

Encryption is the process of encoding data in a way that yields a coded version of data that cannot be practically converted back to the original form without additional information.

75
Q

How is data encrypted at rest in GCP?

A

Data is encrypted at multiple levels, including the application, infrastructure, and device levels.

76
Q

What is data in transit encryption?

A

All traffic to Google Cloud services is encrypted by default.

77
Q

What is Cloud KMS? What does it do?

A

Cloud KMS is a hosted key management service in the Google Cloud. It enables customers to generate and store keys in GCP.

78
Q

What are the three-dimensional maps in Google BigTable?

A

The three dimensions are rows, columns, and cells.

79
Q

How would you manage row-keys in Cloud Bigtable?

A

When a using a multitenant Cloud Bigtable database, it is a good practice to use a tenant prefix in the row-key.

80
Q

What’s a good candidate for a row-key?

A

String identifiers, such as a customer ID or a sensor ID, are good candidates for a row-key.

81
Q

What should we use timestamps with row-keys in Cloud BigTable?

A

Timestamps may be used as part of a row-key, but they should not be the entire row-key or the start of the row-key.

82
Q

What is field promotion in row-keys in Cloud BigTable?

A

Putting another field at the front of a row-key, ahead of a timestamp.

83
Q

How do we use row-keys in tall and narrow tables for time-series databases effectively?

A

Keep names short. This reduces the size of metadata since names are stored along with data values.

Design row-keys for looking up a single value or a range of values.

84
Q

When do we use interleaved tables in CloudSpanner?

A

Use interleaved tables with a parent-child relationship in which parent data is stored with child data. This makes simultaneously reads more efficient.

85
Q

How do you avoid hotspots when designing primary keys properly?

A

Using the hash of a natural key;
Swapping the order of columns in keys to promote higher-cardinality attributes; using a universally unique identifier (UUID)

86
Q

What’s the differences between primary and secondary indexes?

A

Primary indexes are created automatically on the primary key. Secondary indexes are useful when filtering in a query using a WHERE clause.

87
Q

What’s the structure of BigQuery?

A

Projects are the high- level structure used to organize the use of GCP services and resources. Datasets exist within a project and are containers for tables and views

88
Q

How to denormalize data in BigQuery using nested and repeated fields?

A

Denormalizing in BigQuery can be done with nested and repeated columns. A column that contains nested and repeated data is defined as a RECORD datatype and is accessed as a STRUCT in SQL. BigQuery supports up to 15 levels of nested STRUCTs.

89
Q

When and why to use partitioning and clustering in BigQuery?

A

Partitioning is the process of dividing tables into segments called partitions.

BigQuery has three partition types: ingestion time partitioned tables, timestamp partitioned tables, and integer range partitioned tables

90
Q

What are the different kinds of queries of BigQuery?

A

BigQuery supports two types of queries: interactive and batch queries.

91
Q

What is the difference between interactive queries and batch queries?

A

Interactive queries are executed immediately, whereas batch queries are queued and run when resources are available

92
Q

True or False? BigQuery can access external data without you having to import it into BigQuery first

A

True. BigQuery can access data in external sources, known as federated sources. External sources can be Cloud Bigtable, Cloud Storage, and Google Drive.

93
Q

True or False. BigQuery ML supports machine learning in BigQuery using SQL.

A

True. BigQuery extends standard SQL with the addition of machine learning functionality.

94
Q

What is Data Catalog?

A

Data Catalog is a metadata service for data management. Its primary function is to provide a single, consolidated view of enterprise data.

95
Q

Where does Data Catalog collect metadata automatically from?

A

Cloud Storage, Cloud Bigtable, Google Sheets, BigQuery, and Cloud Pub/Sub.

96
Q

What is Cloud Dataprep?

A

Cloud Dataprep is an interactive tool for preparing data for analysis and machine learning

97
Q

What is Cloud Dataprep used for?

A

Cloud Dataprep is used to cleanse, enrich, import, export, discover, structure, and validate data. The main cleansing operations in Cloud Dataprep center around altering column names, reformatting strings, and working with numeric values.

98
Q

What is Data Studio tool?

A

The Data Studio tool is organized around reports, and it reads data from data sources and formats the data into tables and charts.

99
Q

What is Cloud Datalab?

A

Cloud Datalab is an interactive tool for exploring and transforming data.

100
Q

What is Cloud Composer?

A

Cloud Composer is a fully managed workflow orchestration service based on Apache Airflow

101
Q

What are the stages of ML pipelines?

A

Data ingestion, data preparation, data segregation, model training, model evaluation, model deployment, and model monitoring are the stages of ML pipelines.

102
Q

Explain Batch and Streaming ingestion.

A

Batch data ingestion should use a dedicated process for ingesting each distinct data source. Batch ingestion often occurs on a relatively fixed schedule, much like many data warehouse ETL processes.

Cloud Pub/Sub is a good option for ingesting streaming data.

103
Q

What are the three kinds of data preparation?

A

Data preparation are data exploration, data transformation, and feature engineering.

104
Q

What is data segregation?

A

Data segregation is the process splitting a dataset into three segments: training, validation, and test data

105
Q

What are the different data types - training data, validation data and test data?

A

Training data is used to build machine learning models.

Validation data is used during hyperparameter tuning. Test data is used to evaluate model performance.

106
Q

What is the process of training a model?

A

Know that feature selection is the process

of evaluating how a particular attribute or feature contributes to the predictiveness of a model.

107
Q

What is underfitting, overfitting, and regularization?

A

Underfitting - it doesn’t fit the model at all.
Overfitting - It matches the training data too much
Regularization - Punishing data points for overfitting, making the model more complicated

108
Q

How do we evaluate a model?

A

Methods for evaluation a model include individual evaluation metrics, such as accuracy, precision, recall, and the F measure; k-fold cross-validation; confusion matrices; and bias and variance.

109
Q

What is Bias and Variance?

A

Bias is the difference between the average prediction of a model and the correct prediction of a model. Variance is the model variability.

110
Q

What are some ML deployment products that GCP has?

A

Cloud AutoML, BigQuery ML, Kubeflow, and Spark MLib

111
Q

Are single machines are useful for training small models?

A

Yes, you can do Jupyter Notebooks.

112
Q

How can you use Tensorflow to help with distribution?

A

Distributing model training over a group of servers provides for scalability and improved availability

113
Q

What are some considerations for serving models?

A

When serving models, you need to consider latency, scalability, and version management.

114
Q

What is edge computing?

A

Edge computing is the practice of moving compute and storage resources closer to the location at which they are needed.

115
Q

What kinds of data are the edge devices holding?

A

Edge devices provide three kinds of data: metadata about the device, state information about the device, and telemetry data.

116
Q

What are the differences between GPUs and TPUs?

A

Both used for Deep learning training, but TPU doesn’t have Von Neumann bottleneck.

117
Q

What are the supervised algorithms?

A

Supervised algorithms learn from labeled examples.

118
Q

What is the unsupervised learning?

A

Unsupervised learning starts with unlabeled data and identifies salient features, such as groups or clus- ters, and anomalies in a data stream.

119
Q

What is reinforcement learning?

A

Reinforcement learning is a third type of machine learning algorithm that is distinct from supervised and unsupervised learning. It trains a model by interacting with its environment and receiving feedback on the decisions that
it makes.

120
Q

What is a classification, supervised algorithm?

A

Classification models assign discrete values to instances.

121
Q

What is a regression, supervised algorithm?

A

Regression models map continuous variables to other continuous variables.

122
Q

What are some unsupervised learning algorithms?

A

Unsupervised learning algorithms find patterns in data without using predefined labels. Three types
of unsupervised learning are clustering, anomaly detection, and collaborative filtering

123
Q

What is the structure for a neural network?

A

The network is composed of artificial neurons that are linked together into a network. The links between artificial neurons are called connections. A single neuron is limited in what it can learn. A multilayer network, however, is able to learn more functions. A multilayer neural network consists of a set of input nodes, hidden nodes, and an output layer.

124
Q

What is baseline and batches, feature engineering and bucketing, training terminology, gradient descent, backpropagation, neural network, activation function, dropout, precision and recall.

A

Explain

125
Q

What is the issue with poor data?

A

Poor models

126
Q

What are some common data-quality problems?

A

Some common data-quality problems are missing data, invalid values, inconsistent use of codes and categories, and data that is not representative of the population at large.

127
Q

What is the Vision AI api?

A

Gets text from images.

128
Q

What is the Video Intelligence API?

A

Video Intelligence API can extract metadata; identify key persons, places, and things; and annotate video content.

129
Q

What does DialogFlow do?

A

Dialogflow manages chatbots.

130
Q

What is Cloud Text-to-Speech API?

A

Takes text to speech.

131
Q

What is Cloud Translation API?

A

A translating API.

132
Q

What is Recommendations AI API?

A

Suggesting products to customers based on their behavior on the user’s website and the product catalog of that website.

133
Q

What is Cloud Inference API?

A

Cloud Inference API. The Cloud Inference API provides real-time analysis of time-series data.