GCP Professional Data Engineer Cert Flashcards

(264 cards)

1
Q

Relational Databases

A

Has relationship between tables

Google Cloud SQL: Managed SQL instances-don’t have to set much up, Multiple database engines like MySQL, Scalability and availability vertically scales to 64 cores, MySQL has different instances it is also secure-Cloud SQL proxy or SSL/TLS, or have private IPs there are also maintenance windows and automated backups , point in time recovery instance stores
Importing MySQL Data Commands: InnoDB mysqldump export/import, CSV import, External replica promotion-need binary log retention
PostgreSQL Instances are another option-have automated maintenance, unsupported features but it has high availability
Import PostgreSQL commands: SQL dump export/import, CSV import

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Cloud Firestore

A
  1. Fully managed No SQL database-server less autoscaling, NoSQL document store
  2. Realtime DB with mobile SDKs, Android and IOS client libraries, frameworks for popular programming languages
  3. Strong scalability and consistency-horizontal autoscaling
    Bundle multiple documents=collection
    Messages are sub collections
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cloud Spanner

A
  1. Managed SQL -compliant DB-SQL schemas and queries with ACID transactions
  2. Horizontally scalable: Strong consistency across rows, regions from 1 to 1,000s of nodes
  3. Highly available-automatic global replication, no planned downtime and 99.9999% SLA

High Cost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

CAP Theorem

A

Consistency-one change data with specific rules, Availability-always available to do queries, Partition Tolerance-needs to tolerate failures and has to be tolerant of any loss of partition parts
Most likely you will have two parts at once
Spanner is strongly consistent and highly available, sometimes it will choose consistency over availability, global private network, five 9s of availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cloud Spanner Architecture

A

An allocation of resources, instance configuration-regional or multi-regional, initial number of nodes
Region configuration: a region has a zone/multiple zones. With each instance, specify the node count as 1 and each replica is powered by each virtual machine, by moving the node number up you are adding more machines for more computing power. The replicas stay the same, but machines/nodes can change. Therefore, you can connect the different replicas across different zones to create a node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cloud Memorystore

A

In memory database
1. Fully managed Redis Instance-provisioning, replication, failover-fully automated
2. Basic tier: efficient cache that that can withstand a cold restart and a full data flush
3. Standard tier-adds a cross-zone replication and automatic failover
Benefits-no need to provision own VMs,scale instances with minimal impact, private IPs and IAM, automatic replication and failover
Creating an Instance: Version 3.2 or 4, choose service tier and region, memory capacity 1-300GB(Determines network throughput), add configuration parameters
Connecting to Instances: Compute Engine, Kubernetes Engine, App Engine, Cloud Function(server-less VPC connector)
Import and Export: Export to RDB backup: BETA, admin operation not permitted during esport, may increase latency, RDB file written to Cloud Storage
Import from RDB backup: Overwrites, all current instance data, instance unavailable during import process
Use Cases: Redis can be used as a Session Cache that the common uses are logins, and shopping carts, a Message Queue that queues messages and operates to enable loosely-coupled services, or a Pub/Sub advanced message

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Comparing Storage Options

A
  1. Ask yourself if this is structured or unstructured data? Structured: SQL data, NoSQL data, Analytics data, Keys and values Unstructured: Binary blobs, videos, images, proprietary files-unstructured data use the Cloud Storage Option
  2. Is the data going to be used for Analytics? Low Latency vs Warehouse. Low latency: Petabyte scale, single-key rows, time series or IoT data.-Choose Cloud Bigtable Warehouse: petabyte scale, analytics warehouse, SQL queries-Choose Bigquery
  3. Is this relational data? Horizontal Scaling vs Vertical scaling. Horizontal Scaling: ANSI SQL works, global replication, high availability and consistency, it’s expensive but can the client afford it, most financial institutions would probably use this-Choose Cloud Spanner. Vertical scaling: MySQL or PostgreSQL, managed service, and high availability-Choose Cloud SQL.
  4. Is the data Non-relational? NoSQL vs Key/Value. NoSQL: Fully managed document database, strong consistency, mobile SDKs and offline data-Choose Cloud Firestore. Key/Value: managed Redis instances, does what Redis does-Choose Cloud Memorystore.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Streaming

A

Continuous collection of data, near real time analytics, windows and micro batches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Batch

A

Data gathered with a defined time window, large volumes of data, data from legacy systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

No-SQL

A

Anything not sql-key values stores, json document stores, mongoDB and Cassandra tools

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

SQL

A

Row tabular data ,Relational-connect to other tables/queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

On-Line Analytical Processing (ONLAP)

A

Low volume of long running queries
Aggregated historical data-purchasing analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

On-Line Transactional Processing(ONLTP)

A

High volume of short transactions, high integrity, sql
Modifies the database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Defines Big Data

A
  1. Volume: Scale of information being handled by data processing systems
  2. Velocity: Speed at which data is being processed, ingested, analyzed, and visualized
  3. Variety The diversity of data sources, formats, and quality.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Map Reduce

A

A programming model-Map and Reduce functions
Distributed Implementation
Created at Google to solve problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Map Function

A

takes an input from the user, produces a set of intermediate key/value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Reduce Function

A

Merges intermediate values associated with the same intermediate key, forms a smaller set of values
This method standardized the framework, implementation abstracts away the distributed computing framework: Parallelizing and executing-partitioning, scheduling and fault tolerance
Splits all the jobs to small chunks
Master and worker cluster model
Failed worker jobs reassigned
Worker files buffered to local disk
Partitioned output files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Hadoop and HDFS

A

Named after a toy elephant-inspired by google file system-originated in Apache Dutch-sub project began in 2006

Modules: Hadoop Common-base model and has starting scripts, Hadoop Distributed File System(HDFS)-distributed fault tolerates system that runs on commodity hardware as part of a Hadoop cluster, Hadoop YARN-handles resource management tasks like job scheduling and monitoring for Hadoop jobs, Hadoop MapReduce-Hadoop’s own implementation of the MapReduce model which includes libraries for map and reduce functions, partitioning, reduction, and custom job configuration parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

HDFS Architecture-can help with Cloud Dataproc

A

There is a Server-within the server there is a Name Node-within the Name Node, there is Metadata
In the other server-there is a Data Node which stores very large files across a cluster and the files are stored as a series of blocks
The Racks are in between the cluster to design the shortest network path possible
The client can make multiple requests to a name node across racks to get data from multiple nodes
Servers/clusters can be replicated for fault tolerance
The YARN architecture is similar but in the Server, it has a Node Manager and a Server can have a Resource Manager-The client sends jobs to the resource manager, then on individual workers, the Node Manager process runs to handle local resources, request tasks from the master and return the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Apache Pig-A high level framework for running MapReduce jobs on Hadoop clusters

A

Platform for analyzing large datasets
Pig Latin defines analytics jobs: Merging, Filtering, and transformation-high level but like SQL simplicity
Good for ETL jobs since it has a procedural data flow
And it is an abstraction for MapReduce
The Apache Pig will compile our instructions into MapReduce jobs and then are sent to Hadoop for parallel processing across the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Apache Spark

A

Linear flow of data was an issue- like reading mapping across data reduce results and writing to a disk

The Adobe Spark-General purpose cluster-computing framework-allows for concurrent computational jobs to be run across massive datasets
It uses general purpose cluster-computing framework, resilient distributed data multisets, working set as a form of distributed shared memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Spark Modules

A

Spark SQL-structured data in spark stored in abstraction, programmatic querying-data frames API
Spark Streaming-streaming data ingestion in addition to batch processing-very small batches
MLLib-machine learning library, machine learning algorithms-classification, regression, decision trees
GraphX-iterative graph computation
Supports languages: Python, Java, Scala, R, SQL
MUST have 2 Things: A Cluster Manager-YARN or Kubernetes and a distributed Storage System-HDFS, Apache HBASE, and Cassandra

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Hadoop vs Spark

A

Hadoop: Slow disk storage, high latency, slow, reliable batch processing
Spark: Fast memory storage, low latency, stream processing, 100x faster in-memory, 10x faster on disk, more expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Apache Kafka

A

Publish/subscribe to streams of records
Like a message bus but for data
High throughput and low-latency-ingesting millions events through devices
Ex: Handling >800 Billion messages a day at LinkedIn
Four main APIs in Kafka: Producer-allows app to stream records to a Kafka topic. Consumer-allows app to subscribe to one or more topics/process a stream of records contained within. Streams-an API designed to allow an application to be a stream processor itself-transform data then send it back to Kafka

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Kafka vs Pub/Sub
Kafka: Guaranteed message ordering, tunable message retention, polling(Pull) subscriptions only, unmanaged Pub/Sub: No message ordering guaranteed, 7 day maximum message retention, pull or push subscription, managed
26
Pub Sub Intro
Message Bus takes care of all messages between devices Pub/Sub splits it in different topics-anything can publish a message to a topic or choose to receive a message from a topic. Information from users/apps are published to a topic Topics are covered by a message bus-introduced resilience-Pub/Sub is a shock absorber Cloud Pub/Sub: Global messaging and event ingestion, server less and fully managed, 500 million messages per second, 1TB/s of data Pub/Sub Great Features-Multiple publisher/subscriber patterns, at least once delivery, real time or batch, integrates with Cloud Dataflow
27
Use Case Distributing Workloads Pub/Sub
queue up a large number of tasks in a Pub/Sub topic and distribute it amongst multiple workers-like compute engine instances
28
Asynchronous Workflows Pub/Sub
controls order of events, order can be sent into a topic which could then be consumed by a worker system like invoicing before passing it into a queue for the next system to consume like packaging and posting
29
Distributing Event Notifications Pub/Sub
A systems sets up new users when they register with your service, a registration could publish a message and the system could be notified to the set the user up Distributed Logging:Logs could be sent to a Pub/Sub topic to be consumed by ,multiple subscribers. Like a monitoring system and an analytics database for later querying
30
Device Data Streaming Pub/Sub
Hundreds of thousands and more internet connected devices can stream their data into Pub/Sub topics so that they can be consumes on demand by your analytic streams or could be transformed through Dataflow first
31
One-to One Pub/Sub
There is a Publisher, Topic, and then a Subscriber Publisher sends messages to the topic in Pub/Sub The subscriber receives the messages and reads them through their own subscription
32
Many to many Pub/Sub
Just like the one-to-one pattern but this has multiple topics
33
Publishing Messages
Create a message containing your data, JSON payload that’s base64 encoded, size of payload 10MB or less, then send payload as a request to the Pub/Sub API-specify the topic the message should be published on
34
Receiving Messages
Create a subscription to a topic, subscriptions are always associated with a single topic. Pull delivery method is the default delivery method and can take ad hoc pull requests to the Pub/Sub API-specifying your subscription to receive messages and when you receive the message, note that you have received it or else you won’t get the next message. Push delivery method will send messages to an endpoint-the endpoint must be HTTPS with a valid SSL cert-accepts POST requests
35
Integrations Pub/Sub
Client libraries for popular languages like Python, C#, Go, Java, Node, PHP, and Ruby. Cloud Dataflow supported and you can use the Apache Beam SDK to read messages or in batches. Also supported Cloud Functions and Cloud Run, Foundation of Cloud IoT Core-sends and receives messages from connected devices Developing for Pub/Sub: Local Pub/Sub emulator-Google Cloud SDK and Java Runtime Environment 7+
36
Advanced Pub/Sub Topics
At Least Once Delivery: Each message is delivered at least once for every subscription Undelivered Messages: deleted after the message retention duration-default 7 days-can’t be longer Messages published before a subscription is created will not be delivered to that subscription Subscriptions expire after 31 days of inactivity-new subscriptions with same name have no relationship to the previous subscription
37
Other Features
Seeking Feature: Set retain ack messages to True so it retains the messages sent to the topic-default messages are retained for a maximum of 7 days. Then you can tell the subscription to seek to a specific time period in the timeline-basically rewinds the clock to receive past messages. You can also seek in a future timestamp Snapshots: useful if you are deploying new code. You can save snapshot ahead of time to save the current state of the subscription and to save future and unacknowledged messages Ordering Messages: May not receive messages in the right order=use timestamps when final order matters, or consider an alternatives for transactional ordering-maybe through a SQL query Resource Locations: Messages stored in nearest region, message storage policies allow you to control this, additional egress fees may apply
38
Access Control Pub/Sub
Use service accounts for authorization, grant per-topic or per-subscription permission, grant limited access to publish or consume messages
39
Exam Tips Pub/Sub
Think about where you can decouple data-Pub/Sub is a shock absorber, receives data globally and it can be consumed by other components at their own pace Where can you use Pub/Sub for events. It can add event logic to a stack and it can pass events through one system to another Be aware of Pub/Sub limitations-message data must be 10MB or less, beware of expired messages and unused subscriptions Look for Apache Kafka in use cases, if this comes up, Pub/Sub can be a good option Keep an eye out for Cloud IoT as a solution Google Cloud Tasks-get familiar with it Browse the reference architectures-Smart Analytics references
40
What is Dataflow
Fully managed, server less tool, uses open source Apache Beam SDK, Supports expressive SQL, Java, and Python APIs, Realtime and batch processing, stack integration Beams unify, develop and model which allows us to reuse code across streaming and batch pipelines Sources: Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka Common Sinks: Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data
41
Dataflow Process
You have a Pipeline, Source, and Sink The Pipeline will take data from the Source processes data and then places it into the Sink Apache Beam connectors allow you to connect to the Source and Sink so you can read and then write your output data into the Sink
Common
42
Common Dataflow Sources
Cloud Pub/Sub, BigQuery, and Cloud Storage, it can be external to GCP like Kafka
43
Common Dataflow Sinks
Cloud Storage, BigQuery, and Bigtable, Cloud Machine Learning can be applied to sync Data
44
Driver
program you write using the Apache SDK-Java or Python. It defines your pipeline Pipeline: full set of transformations that your data undergoes from initial ingestion to final output Driver goes to the runner
45
Runner
a software that manages the execution of your pipeline, a translator for the Backend execution framework can also manage local execution of Driver programs for testing and debugging
46
PCollections
used in pipelines and represent data as it is transformed within the pipeline kind of represents a multi-element dataset. They can also represent both batch and streaming data. Data coming from a fixed source, the dataset=Bounded, treated like a batch Continuously updating source, dataset=Unbounded(Stream) The PCollection is usually from reading an external source Transform usually represents a step in your pipeline, transforms use PCollections as inputs and outputs, each transform takes one or more PCollections as inputs and generates 0 or more output PCollections
47
Pipeline Development Cycle
1. You have to Design your pipeline first-input and output methods, structure and transformations 2. Then you Create it-instentiating a pipeline object, implementing transformations that were identified 3. Testing: debugging a failed pipeline execution on a remote system, try to do local unit testing
48
Considerations
1. Start with the location of your data 2. Input data structure and format 3. Transformation objectives 4. Output data structure and location
49
Pipelines Structures
Basic: Input linear to output Branching: PCollection and there is branching that applies to a single PCollection which result in two different PCollections Branching can also be conducted on a Transform Pipeline branches can also be merged, you need to merge all branches of your pipeline at some point through a Flatten or Joined Transform Pipelines can also have multiple sources and they can be independently transformed
50
DAG
Dataflow Pipelines represent a Directed Acyclic Graph or DAG-a graph with a finite number of vertices and edges-no directed cycles
51
Pipeline Creation
1. Create an Object 2. Create a PCollection using read or create transform 3. Apply multiple transforms as required 4. Write out final PCollection 5. Execute the pipeline using the pipeline runner
52
ParDo
generic parallel processing transform: can take an element from PCollection1 and transform it to PCollection2, can output 1, none, or multiple output elements from a single input element
53
User-defined function(UDF)
user written code that describes the operation to apply to each element of the input PCollection
54
Aggregation Transformation
The process of computing a single value from multiple input elements, doing this for all elements and then going into a single window
55
Characteristics of PCollections
1. Any Data Type-must be same type 2. Don’t support random access 3. Immutable or unchanging 4. Boundedness-no limit to the number of elements a PCollection can contain-can be Bounded-finite number of elements or Unbounded-does not have an upper limit 5. Timestamp is associated with every element of a PCollection-initially assigned by the source that results in the creation of the PCollection
56
Core Beam Transforms
1. ParDo-generic parallel processing transform 2. GroupByKey-processes collection of key value pairs, collects all values associated with a unique key 3. CoGroupByKey-used when combining multiple PCollections-performs a relational join of two or more key value PCollections where they have the same key type 4. Combine-requires you to provide a function that defines the logic for combining elements, had to be associative and commutative-sum, min, max 5. Flatten- merges multiple input PCollections into a single logical PCollection 6. Partitioning-provides the logic that determines how the elements of the PCollection are split up.
57
Event time Dataflow
event time data element occurs determine by timestamp on data element itself, Processing time refers to the different times the element was processed during the transit in your pipeline
58
Windowing
Assigned to a PCollection, subdivides the elements of a PCollection according to their timestamps, do this to allow grouping or aggregating operations over unbounded collections, it groups elements into finite windows
59
Fixed Window
1. Fixed-simplest, constant non overlapping time interval
60
Sliding Window
2. Sliding-represent time intervals- but it can overlap, and an element can belong to more than one window-useful to take running averages of data
61
Per Sessions
a different session window is created in a stream when there is an interruption in the flow of events which exceeds a certain time period, apply on a per key basis-useful for irregularly distributed data with respect to time
62
Single global
everything else-window transform
63
Watermark
the system’s notion of when all the data for a certain window can be expected to have arrived-late data = watermark moves past the end of the window and any further data elements arrive with a timestamp within that window
64
Triggers
1. Event time, event-time based 2. Processing time 3. Data driven-when data in a particular window meets a certain criterion 4. Composite-combine other triggers in different ways
65
Pipeline Access
Run Cloud Dataflow pipelines 1. Can be run locally 2. Submit pipeline to GCP Dataflow managed service GCP service accounts 1. Cloud Dataflow service-uses Dataflow service account 2. Worker instances-Controller service account
66
Cloud Dataflow Managed Service
1. The pipeline gets submitted to the GCP Dataflow Service 2. The Dataflow will create a Job 3. The Job creates managers and workers to carry out various tasks 4. For the execution, the workers need files/resources from Cloud Storage 5. The Job can be monitored with the Cloud Dataflow Monitoring Interface or the Cloud Dataflow Command-line Interface
67
Cloud Dataflow Service Account
1. Automatically created when Cloud Dataflow project is created 2. Manipulates job resources 3. Assumes the Cloud Dataflow service agent role 4. Has Read/Write Access to project resources
68
Controller Service Account-used by the workers-uses the Compute Engine
1. Compute Engine instances-execute pipeline operations 2. Run Metadata operations-don’t run on local clients or compute engine workers-determine size of file in Cloud Storage 3. User-managed controller service account-used resources with fined grained access control
69
Security Mechanisms
1. Submission of the pipeline-users have to have the right permissions 2. Evaluation of the pipeline-encrypted, not persisted beyond evaluation of the pipeline, communication between workers over a private network-subject to projects, permissions, and firewalls specify region and zone 3. Accessing telemetry or metrics-encrypted at rest-controlled by project’s permissions 4. You can also use Cloud Dataflow IAM roles
70
Regional Endpoints in Dataflow
1. Manages metadata about Cloud Dataflow jobs 2. Controls Cloud Dataflow workers 3. Automatically selects best zone Good reasons for regional endpoints 1. Security and compliance 2. Data locality 3. Resiliency
71
Machine Learning with Cloud Dataflow
1. Handles data extraction from Cloud Storage 2. Data Preprocessing in Apache Beam pipeline through Cloud Dataflow, TensorFlow API used to normalize some values between 0 and 1, the Beam partition transform is used to split the data set into the training data set and the evaluation data set 3. TensorFlow is used to train a model locally on your machine or through Cloud Machine Learning-doesn’t use Cloud Dataflow 4. Predictions-Cloud Dataflow-read from Cloud Dataflow from Pub/Sub into another Pub/Sub topic
72
Benefits of Dataflow
You can use customer-managed encryption keys Batch pipelines can be processed in a cost-effective manner with Flexible Resource Scheduling(FlexRS)-uses Advance scheduling, Cloud Dataflow Shuffle service, preemptive VMs Cloud Dataflow is great for MapReduce jobs to Cloud Dataflow-on premises map reduce jobs can be rebuilt on cloud dataflow Cloud Dataflow with Pub/Sub Seek-replay and reprocess previously acknowledged messages-especially in bulk
73
Cloud Dataflow SQL
1. Develop and run Cloud Dataflow jobs from the BigQuery web UI 2. Cloud Dataflow SQL (ZetaSQL variant) integrates with Apache Beam SQL Apache Beam SQL-Query bounded and unbounded PCollections, Query is converted to a SQL transform Cloud Dataflow SQL-Utilise existing SQL skills, join streams with BigQuery tables, query streams or static datasets, write output to BigQuery for analysis and visualization
74
Dataflow Exam Tips
Beam and Dataflow are the preferred solution for streaming data-especially for streaming data Pipeline: represents the complete set of stages required to read data perform any transformations and write data PCollection: represents a multi-element dataset that is processed by the Pipeline ParDo: core parallel processing function of Apache Beam which can transform elements of an input PCollection into an output PCollection. DoFn: template you use to create user-defined functions that are referenced by a ParDo Sources-where data is read from Sinks-where data is written from Window: allows streaming data to be grouped into finite collections according to time or session-based windows Watermark: indicates when Dataflow expects all data in a window but past the watermark is considered late Dataflow is normally the preferred solution for data ingestion pipelines Cloud Composer is sometimes used for ad hoc orchestration/provide manual control of Dataflow pipelines themselves
75
What is Dataproc?
A managed cluster service for Hadoop and Apache Spark Managed preferable because it is low costing and you can control which clusters to grow and which clusters to turn off
76
Dataproc Architecture
Master: it creates a master node running the YARN resource manager and then runs the Hadoop, HDFS name nodes It also runs the Worker Nodes Pre-installed and have Hadoop, Apache Spark, Zookeeper, Hive, Pig, Tez, and other tools like Jupyter Notebooks and GCS connector Storage and configuration handles by Dataproc
77
Dataproc benefits
1. Cluster actions complete in ~90 seconds 2. Pay-per-second minimum 1 min 3. Scale up/down or turn off at will
78
Using Dataproc
You can submit Hadoop/Spark jobs, Enable autoscaling-if necessary to cope with the load of the job, Output to GCP Services-like Google Cloud Storage, BigQuery and BigTable, you can also Monitor with Stackdriver-fully integrated logging and monitoring for the job performance and output
79
Cluster Location
Regional: Isolate resources used for Dataproc into one region like us-east1 and Europe-west1 Global: Resources not isolated to a single region-can place cluster in any zone worldwide
80
Single Node Cluster
a single VM that will run the master and work the processes-can’t autoscale
81
Standard Cluster
Has a Master VM that runs YARN Resource Manager and the HDFS Name Node, and it has two Worker Nodes that run a YARN Node Manager and a HDFS Data Node-this is customizable for the disk, there are also Pre-emptible Workers-sometimes help with large projects, but can’t provide storage for HDFS
82
High Availability Cluster
You have three Masters with YARN and HDFS configured to run in high availability mode-no interruptions
83
Submitting Jobs
1. Gcloud command line 2. GCP Console 3. Dataproc API 4. SSH to Master Node
84
Monitoring and Logging
1. Use Stackdriver Monitoring to monitor cluster health 2. Cluster/yarn/allocated_memory_percentage 3. Cluster/hdfs/storage_utilization 4. Cluster/hdfs/unhealthy_blocks
85
Custom Clusters
You can customize the Dataproc default image, Google gives a script, then under the Dataproc default image, there are Custom PKGs where you can apply the customization script you have written, then you can store it in Google You can also have: Custom cluster properties-so you can change the values You can add initialization actions that are custom to the cluster-scripts loaded to a Cloud Storage Bucket-mostly for Staging binaries You can also Custom Java/Scala dependencies-saves you from precompiling
86
Autoscaling in Dataproc
Huge Bonus: you can create lightweight clusters and have them automatically scale up to the demands of the job-written in YAML, has configuration numbers for primary workers and secondary workers
87
When to not use Autoscaling
1. When having HDFS 2. When having Apache Spark Streaming 3. When having Idle Clusters 4. YARN Node Labels
88
Workflow Templates
Written in YAML that can specify multiple jobs w/ different configs and parameters that can be run in succession Workflow Templates have to be created, then instantiated with GCloud-you can send jobs to a new cluster each time or to an existing cluster
89
Advanced Compute Features Dataproc
1. Local SSDs-faster runtimes 2. GPUs to nodes-for machine learning
90
Cloud Storage Connector
1. Use GCS instead of HDFS 2. Cheaper than persistent disk 3. High availability and durability 4. Decouple storage from cluster lifecycle
90
Exam Tips
Know when to choose Dataproc: Quickly migrating Hadoop and Spark workloads into Google Cloud Platform Understand the benefits of Dataproc: Managed over Hadoop or Spark cluster-Ease of scaling, being able to use Cloud Storage instead of HDFS, and the connectors to other GCP services like BigQuery and Bigtable Know Cluster Options: When to pick standard vs high availability, autoscaling and ephemeral Get to know open-source Big Data Ecosystem-Hadoop, Spark, Zookeeper, Hive, Tea, and Jupyter Know when to choose Dataflow-sometimes it is the preferred product for big data ingesting, like in streaming workloads and it implements the Apache Beam SDK
91
Bigtable Concepts
Managed wide-column NoSQL database-series of key value pairs where the values are split into columns Has a very High Throughput-10,000 reads per second Also has low-latency-6 milliseconds per node Scales linearly Out of the box high availability-cross cluster replication Developed internally by Google and was used for Google Earth, Finance, and Web Indexing Since HBase was created and was the open source implementation of the Bigtable model, it was adopted as a top level Apache project and the Cloud Bigtable supports Apache HBase library for Java
92
Cloud Bigtable
1. Has a ROW KEY as the only index 2. Then it can be attached to columns 3. The columns can be grouped by families 4. The empty values don’t take up any space since it’s a sparse db 5. Scaled to thousands of columns and billions of rows
93
Important Bigtable Features
1. Blocks of contiguous rows are -shared into tablets 2. Tablets are chunks of sorted rows-put together they form a complete table-managed by nodes in your cluster 3. Tablet data is stored in Google Colossus-can scale cluster sizes 4. Splitting, merging, and rebalancing happen automatically
94
Bigtable Scenarios
Suited well for financial, marketing data and transactional data Also good for time series data and data from IoT devices Good for streaming data and machine learning applications
95
Bigtable Architecture
You create an instance You have a instance type, storage type and app profiles-describes parameters for incoming connections To connect, you use an instance ID and an application profile Inside the instance, you have clusters Inside the clusters you have nodes which are workhorses of Bigtable The flexibility of Data Storage comes from separating our cluster nodes and storing data in Colossus Nodes control tablets and a tablet can’t be shared by more than one node
96
Instance Types
1. Production-1+ clusters, 3+ Nodes per cluster 2. Development-Single node cluster for developmental work-development instance can’t use replication and doesn’t have SLA and a cheaper option
96
SSD Storage Type
Almost always the right choice, fastest and most predictable option, 6ms latency for 99% of reads and writes, each node can well process 2.5 TB SSD data
97
HDD Storage Type
Each node can process 8 TB HDD data, throughput is limited so it will not have as much IO overhead for processing nodes, then the row reads are 5% the speed of SSD reads, the storing at least 10 TB of infrequently-accessed data with no latency sensitivity-could spend more money on clusters
98
Application Profiles
1. Custom application specific settings for handling incoming connections 2. Single or multi-cluster routing 3. In single routing: it will route to a single router that you define even if you have multiple clusters in an instance 4. Multi: Will route to nearest most available cluster and if it is unavailable, it will go to the next cluster 5. You have to ask if data needs single row transactions-then you have to have single routing
99
Bigtable Configuration
1. Instances can run up to four clusters 2. Clusters exist in a single zone 3. Up to 30 nodes per project 4. Maximum of 1,000 tables per instance
100
Bigtable Access Control
1. Cloud IAM roles 2. Applied at project or instance level to- 3. Restrict access or administration. 4. Restrict reads and writes 5. Restrict development instances or production access
101
Data Storage Model
ROW KEYs can only be indexed Column families allow is to grab what we need only Column names are called column qualifiers You can write new data values and the old ones aren’t overwritten You can control how much is stored and for how long it is configurable-detailed granularity, array of bytes
101
Alternative Options to Bigtable
1. Need SQL Support OLTP: Cloud SQL 2. Need Interactive Queries OLAP and cheaper: BigQuery 3. Need structured NoSQL Documents: Cloud Firestore 4. Need In-memory Key/Value Pairs: Memorystore 5. Need Realtime Database: Firebase
102
Important Bigtable Info
Rows are sorted alphabetically-design of row key very important Atomic operations are by row only-be careful when updating Sparse table system-doesn’t hurt to have a lot of columns/families even if they don’t apply to every entity Row sizing: no larger than 10MB, total row not including the key should be under 100MB
103
Timestamps and Garbage Collection
1. Each cell has multiple versions 2. Server recorded timestamps 3. Sequential numbers 4. Expiry policies define garbage collection: can expire based on a specific age or specific number of versions 5. Setting the client with an HBase will set the policy to only retain the test version of a cell, if you use any other client library, then it will set the column family to store infinite versions CBT is an alternate way to connect to Bigtable
104
Bigtable Schema Design
Have to scan the entire table and then filter the results based on a regular expression match tot he string contained in that column cell-most time expensive way to query Bigtable You have to pain your queries ahead of time Field promotion: taking data that you already know and then moving it into the row key itself Then you can write a command like: scan ‘vehicles’, {ROWPREFIXFILTER => ‘NYMT#86#’} You can also include a timestamo in the ROW KEY design Never put a timestamp in front of the row key
104
Designing Row Keys
1. Queries use: A row key, a row prefix 2. A row range is returned 3. Reverse domain names 4. String identifiers-reads and writes evenly spread 5. Timestamps as only a part of a bigger row key design if it is not first and is reversed
105
Row Keys to Avoid
1. Domain names in order 2. Sequential numbers 3. Frequently updated identifiers 4. Hashed values
106
Design for Performance
1. Lexicographic sorting 2. Store related entities in adjacent rows 3. Distribute reads and writes evenly 4. Balanced access patterns enable linear scaling of performance
107
Avoid Hotspots
1. Use Field Promotion instead 2. Try Salting-salted hash to your row key artificially distributes the rows, based on total number of nodes 3. Use Google’s Key Visualizer tool
108
Time Series Data in Bigtable
1. Use tall and narrow tables where each row might contain a key and maybe only a single column 2. Use rows instead of versioned cells 3. Logically separate tables 4. Don’t reinvent wheel -already good timetable schemas out there Open TSDT project
109
Monitoring Bigtable
1. Via GCP Console or Stackdriver 2. Average CPU utilization of cluster and hottest node 3. Single cluster instance-aim for average CPU load of 70% and the hottest node not over the CPU of 90% 4. For 2 clusters and replication instances-multi cluster routing brings in additional overhead where the average CPU load should be 35% and the hottest node CPU load at max 45% 5. For storage utilization, try to keep it on 70% per node 6. To monitor this, try to create application profiles for each application
109
Autoscaling Bigtable
1. Stackdriver metrics can be used for programmatic scaling-done on local computer 2. Client libraries query metrics 3. Update cluster node counts via API 4. Rebalancing tablets can take time and the performance might not improve for 20 mins 5. Adding nodes to a cluster doesn’t solve the problem of a bad schema
110
Replication Bigtable
1. Adding additional clusters automatically starts replication I.e data synchronization 2. Replication is eventually consistent 3. Used for availability and failover 4. Application isolation 5. Global presence
111
Good Performance in Bigtable
Replication improves read throughput but does not affect write throughput Use batch writes for bulk data with rows that are close together lexicographically Monitor instances and use the Key Visualizer to monitor hotspots and bad row keys Bigtable rebalances tablets-first they all go in the first node, but then they rebalance or spread out to the other growing nodes-the tablets are also being split, merged, and rebalanced to maintain the sorted order of rows Hotspots sometimes pop up and take a lot of CPU, but the other tablets in the node go to other nodes so the overwhelmed tablets are less overwhelmed
112
Good vs Bad Performance in Bigtable
Good Performance: Optimized schema and row key design, large datasets, correct row and column sizing Bad Performance: Datasets short lived or smaller than 300GB
113
Exam tips Bigtable
Know when to choose Bigtable: many questions make you choose the right product for the workload, migrating from an on-premise environ look at HBASE and consider when Bigtable is a better option than BigQuery. Look at time-series data or use cases where latency is an issue Understand the architecture of Bigtable: Concepts of an instance and a cluster, where Bigtable stores data, and how tablets are re-balanced by the service between nodes Be aware of causes of bad performance: Like under-resourced clusters, bad schema design, and poorly chosen row keys. UNDERSTAND ROW KEYS: Linear scale and performance of Bigtable depends on good row keys. Understand row key design, I might have to point out flaws or pick an ideal row key Understand Tall vs Wide: Wide Table-stores multiple columns for a given row-key where the query pattern is likely to require all the information about a single entity. Tall Table-suit time-series or graph data and often only contain a single column Remember Organizational Design: Consider when a development instance is appropriate, remember IAM roles that can be used to isolate access to the necessary groups.
114
What is BigQuery?
Peta byte scale, server less, highly scalable cloud enterprise data warehouse In memory BI Engine-fast interactive reports Has machine Learning capabilities (BigQuery ML)-using SQL Support for geospatial data storage and processing
115
Key Features of BQ
1. High availability 2. Supports SQL-can do SQL queries 3. Federated Data-can connect to and process data stored outside of BigQuery 4. Automatic Backups 5. Governance and Security support-data encrypted at rest and in transit 6. Separation of Storage and Compute-cost effective scalable storage and stateless resilient compute
115
Interacting with BQ
1. Web console 2. Command line tool (bq) 3. Client libraries like C#, Go, Java, Node.JS, PHP, Python, and Ruby
116
Managing Data with BQ
You have a Project and within each Project, you have a Dataset, and within each Dataset, you can have Native Tables, External Tables, or Views
117
Native Table
Data is held within a BigQuery Storage
118
External Tables
Backed by storage outside of BigQuery
119
Views
created by a SQL query
120
Real Time Events BQ
Streaming, common to push events to Cloud Pub/Sub, then use a Cloud Dataflow job to process and push them into BigQuery
121
Batch Sources BQ
Comes in a Bulk Load, common to push files to Cloud Storage, then have a cloud Dataflow job pick that data up, process it and then push it into BigQuery
122
Legacy SQL
1. Previously Called BigQuery SQL 2. Non-standard SQL dialect 3. Migration to standard SQL is recommended
123
Standard SQL
1. Preferred dialect 2. Compliant with SQL 2011 standard 3. Extensions for querying nested and repeated data
124
What you can do with BQ
With BigQuery Data you can: Use BI Tools, Use Cloud Datalab, Export to sheets or Cloud Storage, send it to Colleagues, or use it for GCP Big Data Tools like Dataflow or Dataproc
125
Jobs and Operations in BQ
Job: action that is run in BigQuery on your behalf Load Job: Load data onto BQ Export Job: Export from BQ Query Job: Queries the data in BQ Copy Job: Copies tables and datasets from BQ Query Job priorities: Interactivity(default)-always saved to a temporary table or to a Permanent table, and Batch
126
Table Storage in BQ
1. Capacitor columnar data format 2. Tables can be partitioned 3. Individual records exist as rows 4. Each record is composed of columns 5. Table schemas specified at the creation of the table or at a data load
127
Capacitor in BQ
1. The Storage system: proprietary columnar data storage that supports semi-structured data(nested and repeated tables), imports can be CSV JSON to capacitor format 2. Each value is also stored together with a repetition level and a definition level (value, repetition level, definition level)
128
Denormalization
1. BQ performance best when data is denormalized 2. Nested and repeated columns 3. Maintain data relationships in an efficient manner 4. RECORD(STRUCT) data type-nested records or columns.. EX: Address=address.number, address.street, address.city, a single ID can have multiple addresses
129
Data Formats in BQ
CSV, JSON(newline delimited), Avro(open source data format where schema is stored together with data-compressed data), Parquet(encoded, smaller files), ORC(hive data), Cloud Datastore export, and then Cloud Firestore ports
130
BQ Views
A virtual table defined by a SQL query SQL query (view definition)=Tables Then Those get sent to the Dataset You can also query that view, has billing implications since you would also still be running the underlying query
131
Uses of Views
1. Control access to data 2. Reduce query complexity 3. Construct logical tables 4. Ability to create authorized views-can connect to different subsets of rows from the view
132
Limitations of Views
1. Can’t export data since unmaterialized 2. Can’t use JSON API to retrieve data from a view 3. Cant combine standard and legacy SQL 4. No user defined functions 5. No wildcard table references 6. Limited to 1,000 authorized views per dataset
133
External Data in BQ
You can query directly even though the data is not directly held in BQ BQ supports Cloud Bigtable, Cloud Storage, and Google Drive Use Cases for using External Data Source: Load and clean your data in one pass, or you have small, frequently changing data joined with other tables
134
Limitations for External Data Source in BQ
1. No guarantee of consistency 2. Lower query performance 3. Can’t use TableDataList API method 4. Can’t export jobs on external data 5. Can’t reference wildcard table query 6. Can query Parquet or ORC formats 7. Query results not cached 8. Limited to 4 concurrent queries
135
Other Data Sources
1. Public Datasets-available to everyone 2. Shared Datasets-Have been shared with you 3. Stackdriver log information
136
Data Transfer Service
Can easily pull Bulk Data into BQ with the Data Transfer Service Data Transfer Service has multiple Connectors to Google sources, GCP, AWS services like S3 and Redshift, and other third party services like LinkedIn and Facebook Can be one off events or scheduled to run repeatedly DTS allows historical data reference and uptime and delivery SLA
137
Table Partitioning BQ
Table partitioning-break up big table into smaller tables Partitions stored separately on physical level Partitions usually based on a single column called the partition key. Partitioning BQ 2 Ways 1. Ingestion Time partitioned tables 2. Partitioned Tables
138
Ingestion Time Partitioning
Partitioned by load or arrival date, data automatically loaded into databased partitions(daily), tables include the pseudo-column _PARTITIONTIME, use _PARTITIONEDTIME in queries to limit partitions scanned
139
Partitioned Tables in BQ
Partitioned based on a certain TIMESTAMP or DATE column, Data partitioned based on value supplied in partitioning column, 2 additional partitions: _NULL_ and _UNPARTITIONED_ , use partitioning column in queries BQ automatically places data in right partitions, need to say it is a partition table when creating the table
140
Clustering Tables BQ
Clustering tables: can do them on a partitioned table, you can use clustering when you have filters or aggregations against specific columns in your queries. When partitioned and clustering tables together, it is partitioned by the partition key and then clustered based on the cluster key In cluster tables: the data associated with a certain cluster key is generally stored together Ordering is important
141
Clustering Limitations
1. Only supported only for partitioned tables 2. Standard SQL only for querying clustered tables 3. Standard SQL only for writing query results to clustered tables 4. Specify clustering columns only when table is created 5. Clustering columns can’t be modified after table creation 6. Clustering columns have to be top-level, non-repeated columns 7. You can specify one to four clustering columns
142
Querying Guidelines for Clustering Tables
1. Filter clustered Columns in the order they were specified 2. Avoid using clustered columns in complex filter expressions 3. Avoid comparing cluster columns to other columns Why partition tables 1. Improve query performance 2. Control costs
143
Benefits of BQ Slots
Slots: Unit of computational capacity required to execute SQL queries-good for pricing and resource allocation Number slots query-determined by Query size and query complexity BQ automatically manages your slots quota Flat rate pricing available-purchase fixed number of slots You can see slot usage using Stackdriver
144
Cost Controls of BQ
1. Avoid using SELECT * 2. Use preview options to sample data 3. Price queries before executing them 4. Remember the using LIMT doesn’t affect cost 5. View costs using a dashboard and query audit logs 6. Partition by date 7. Materialize query results in stages 8. Consider the cost of large result sets 9. Use streaming inserts with caution
145
Query Performance Dimensions
1. Input data and data sources 2. Shuffling 3. Query computation 4. Materialisation 5. SQL anti-patterns
145
Input data and data sources best practices BQ
Input data and data sources: Prune partitioned queries, denormalize data whenever possible, use external data sources appropriately, avoid excessive wildcard tables
146
Query Computation Best Practices
Avoid repeatedly transforming data via SQL queries, avoid JavaScript user-defined functions, order query operations to maximize performance, optimize JOIN patterns
147
SQL Anti-Patterns
Avoid Self-Joins, avoid data skew, avoid unbalanced joins, avoid joins that generate more outputs than inputs (Cartesian product), avoid DML statements that update or insert single rows
148
Optimizing Storage BQ
1. Use expiration settings-Control Storage Costs and Optimize use of storage space 2. Take advantage of long-term storage-lower monthly charges apply for data stored in tables or in patterns that have not been modified in the last 90 days 3. Use the Google pricing calculator to estimate the storage costs
149
Primitive Roles
at the project level, granting access to the related project data sets, individual dataset access will overwrite the primitive access. Three types of these roles-Owner, Editor, Viewer
150
Predefined Roles
grant more granular access, defined at the service level, GCP managed
151
Custom Roles
User managed
152
Cloud DLP
Handling Sensitive Data: credit card numbers, med info, SSN, people names, address info can be protected by the Cloud Data Loss Prevention (Cloud DLP) Cloud DLP 1. Fully managed service 2. Identify and protect sensitive data at scale 3. Over 100 predefined detectors to identify patterns, formats, and checksums 4. It also de-identifies the data
153
Encryption in BQ
BQ encrypts data through the Data Encryption Key (DEK) For highest levels of security, the DEK key needs to be encrypted to form the Wrapped DEK-this is done using the Key Encryption Key (KEK) Wrapped DEK/DEK stored together KEK is stored in the Cloud Key Management Service
154
Monitoring/Alerts in BQ
Alerts should be created when a monitoring metric crosses a specifies threshold BQ uses Stackdriver to monitor, bq sends logs to it In stacks-driver, you can filter for the big query logs, create dashboards and then charts to the dashboards, you can create alerts
155
Cloud Audit Logs
Collections of logs that are provided by GCP to allow insights to various services Log Versions AuditData(old)-map directly to individual API calls designed against the query BigQueryAuditMetadata-not strongly coupled to particular API calls, more aligned to resource itself, closely associated with the state of the BigQuery resources can be changed by API calls, services and API tasks Stackdriver has three different streams: Admin, System, and Data-streams are just groupings for different types of logs
156
BQ ML Access
1. Web console (UI) 2. Bq command line tool 3. BQ rest API 4. Jupyter notebooks(Cloud Datalab) and other external BI tools
157
Linear Regression
where you have a number of data points and try to fit a line to those data points
158
Binary Logistic Regression
We have 2 classes and you assign each example to tone of the classes
159
Multi-class Logistic Regression
We assign each example to one of these
160
K-Means Clustering
We have a number of points and are able to separate them out into different clusters-newest one on BQ ML
161
Benefits of BQ ML
1. Democratizing ML 2. Models trained and evaluated using SQL 3. Speed and agility 4. Simplicity 5. Avoid regulatory restrictions
162
EXAM TIPS for BQ
Understand good organizational design: consider how different teams should be granted different types of access to BQ and how the decisions affect cost control Learn the most common IAM roles: learn how to grant access to teams based on needs and how to use authorized views to share data across projects Consider costs when designing queries: Avoid using SELECT * and previews and price queries before executing them Partition tables appropriately: partitioned tables can reduce the cost -consider clustering to reduce scans of unnecessary data Optimize query operations and JOINS
163
What is Datalab?
What is it? 1. A pre-existing technology, wrapped in some GCP conveniences-Jupyter Notebooks
164
Jupyter Notebooks
1. Interactive web pages that have… 2. Documentation 3. Code 4. Elements which are the results of compiled code
165
Datalab functions
When typing code, there os a Cloud Datalab Vm that has a Python Kernel The kernel can run code and access GCP services like BigQuery or ML Engine Good way to collaborate and share code Also a good way to annotate Has marplot.lib and it is great for statistical data and graphs When saving your work though the notebook, it will be in the GCR Repo which is sent to the persistent disk attached to the Datalab instance
166
Why Do We Need Datalab?
1. Manages instance lifecycle 2. Create Datalab VMs in seconds 3. Notebooks stored in GCR 4. Storage can persist after the instance is destroyed
167
Intro to Data Studio
Data Sources Reports and Dashboards Data sources underneath are Databases or Files Files-usually CSV files, stored in Cloud Storage Databases-GCP databases like BigQuery, Cloud SQL, MySQL, Cloud Spanner, PostgreSQL Google Products-Google analytics, Sheets, Youtube, Ads, Google Marketing Platform Third Party-Trello, Quickbooks, Facebook Ads You can Share your dashboards and reports by Viewing or allowing Users to Edit-Like in Google Drive
168
Chart and Filters in Data Studio
Tables-detailed, heat map inclusion, bar chart inclusion, pagination Scorecards-KPIs, high level Pie Chart-proportions, %s or absolute values, doughnut or whole, small amounts of data Times series-time order, trends, curve filtering, forecasting Bar charts-categorical, vertical or horizontal, single or stacked, reports and dashboards Geomaps-geographical data, dashboards Area charts-composition, cumulative totals, reports and dashboards, can combine with time series Scatter plot-cartesian plane, typically 2 variables-or more through color or size through the points, dashboards and reports Filter-allows you to select specific values from a category Data Range-can specify start and end days or predefined intervals like last week, last month, current year or quarter
169
Cloud Composer Overview
Built on Apache Airflow Google is contributing back to the airflow project Task orchestrated system that is designed to automate complex interdependent tasks into pipelines or workflows Each stage of the pipeline is written in code Each workflow is written in Python Provides central management and scheduling Provides and extensive CLI tool and a comprehensive web UI
170
DAG-Directed Acyclic Graph
A graph consisting of nodes connected by edges, edges-how we travel from one node to another, directed-travel in one direction which is acyclic-never circles back, can’t reach the same node more than once by traveling along the edges Possible to represent dependencies-between nodes, that must be traversed in a specific order These dependency nodes represent all the tasks in a workflow organized in a away to show their relationships and dependencies DAGs are represented in Python and it matters when the tasks should be aligned to execute Inside the tasks, there are operators to specify what is to be done DAGs can contain parameters when they should run and what the dependencies are and who should be notified once the dependencies are completed Cloud Composer manages resources to make sure workflow completes successfully
171
Composer Architecture
A microservices architecture Uses multiple GCP resources grouped together into a Cloud Composer environment Can have more than one environment in a GCP project, but each environment is an isolated installation of Airflow and all of its component parts Some parts get put on a Tenant Project-you can’t see or control, places Airflow database and Airflow web server-provides its web UI on App Engine Flex, it will configure the Identity-Aware Proxy to control access to the web server
171
Why use Cloud Composer rather than Dataflow?
Dataflow-Process Batch or Streamed Data-Apache Beam Cloud Composer-Orchestrate tasks with Python and can use any Python code at any stage of the pipeline-more as a scheduler Can orchestrate Cloud Composer with Dataflow Cloud Composer workflow example: Spark Analytics- A workflow runs daily, sets up a Dataproc cluster, performs Spark analytics, writes results to GCS and emails an administrator, and then deletes the Dataproc cluster Cloud Composer can be used as any scheduled automation task outside of big data
172
Composer in a GCP Environment
In the GCP Project environment- Make a Kubernetes cluster that deploys Redis, the Airflow Scheduler, and the Airflow Workers along with the Cloud SQL Proxy, it will also create 2 Pub/Sub topics for messaging between micro services and a Cloud Storage Bucket for logs, plugins and the DAGs themselves Cloud Composer configures- Airflow Parameters and Environment Labels, you can also customize some parameters DAGs can have a file or multiple files with imports and dependencies- the Python script has the Variables, Operators, and Stages of the Workflow tied together with a DAG object and definition
173
Tasks
A Task is an instance of the Airflow Operator The scheduler will find any DAG object that you have defines in the Python scripts uploaded to the GCS bucket, if all dependencies met, workflow will be scheduled When you delete an environment, it won’t clean up all resources it created, but the Tenant Project and the GKE cluster that was spun up will be removed You will have to manually delete the Pub/Sub topics and the GCS bucket
174
Advanced Composer Features
Custom Airflow parameters get written on airflow.cfg file that is used to configure services when airflow is ran for the first time Can’t change all of service settings, some can only be modified Can create Environment Variables which will be passed by Cloud Composer and passed to elements of Airflow like the scheduler, web server, and worker processes Environment Variables are in a Section which are defined in Key Value Pairs Airflow Connection: A collection of authentication information which can include hostnames, logins, keys/secret information Will create connections for BigQuery, Datastore, Cloud Storage, and a Generic GCP connection-these will have a service account key that will authenticate against the GCP API in question Can make custom connections using the Airflow Web UI, connections to Airflow’s own database use the Cloud SQL Proxy In the Web UI, you can use the ad hoc query page to run SQL queries against any connected database-some built in visualizations as well
175
Extending Airflow
1. Local Python Environment by adding additional custom libraries 2. Airflow Plugins-write own custom operator 3. PythonVirtualEnv Operator to create a custom environment for a task in a workflow without installing dependencies across all of your workers-creates individual envs with own libraries 4. KubernetesPod Operator-need complete control over an environment-to run a task inside a pod on the Cloud Composer GKE Cluster
176
Google Pre-Trained ML Models
1. Cloud Vision API-can detect and label objects within images, facial recognition, can read handwritten info 2. Cloud Video Intelligence API-can identify huge numbers of objects, places, and actions which are taking place in videos 3. Cloud Translation API-can translate between more than 100 languages 4. Cloud Text to Speech API: Convert text to audio/human speaking 5. Cloud Speech to Text-convert from audio to text 6. Cloud Natural Language API-can perform sentiment analysis, entity analysis, entity sentiment analysis, content analysis, content classification
177
Reusable Models
Model training required, min knowledge of ML required, relatively small datasets for training, transfer learning, neural architecture search-like if you use GCP, you can search out what the best model is to solve problem
178
Re-use models-Through Cloud AutoML:
when you need something very specific AutoML-transfer learning, allows you to train your own custom models to solve your own specific problems: You have AutoML Vision, Video Intelligence, Natural Language, Natural Translation, and Tables 
Google AI Platform allows you to train your own models, manage and share models-gives easy access to TensorFlow, TensorFlow Extended (TFX)-end to end platform for deploying machine learning pipelines, gives access to TPUs-speed up process, Kubeflow
179
ML Pipeline
ML has a Model that contains Rules from the Data that’s used to train it ML Pipeline 1. Data preparation-raw data becomes processed data 2. Model Training-processed data is split into Training Data and then Testing Data. Training Data gets inputted into the Model, then it is evaluated against the Test Data to determine how well it inferred rules from the training data 3. Operating- Trained Model is used for Real-time Predictions like predicting what people are going to buy, or for Batch Predictions which are normally offline predictions when many predictions are made in a short period of time
180
Label in ML
Label: Something that is an interest to us like a house price-denote labels with y Labelled Example: Has a label associated with a set of features-house size, rooms, locations.. House price =$K, (x1,x2,x3,x4,…)->y Unlabelled Examples: have the set of feature values, but don’t have a label value,size, number of rooms, location.. House price=?, (x1,x2,x3,x4,…)-we want to predict the table value called y prime
181
Feature in ML
Feature: attribute associated with the label and has a relationship with the label, like the size of a house or the number of rooms in a house, denoted with an x-x1,x2… Examples: Features together with labels
182
Measuring Loss and Loss Squared
Measuring Loss: Calculating the difference between the prime value (x7,y7’) and the plot value (x7,y7) with the equation y7-yy7’, so basically the difference between the actual values and predicted values Memory Loss Squared: You have one line going through the actual points, then another line that doesn’t match actual points well, you get the differences between the actual and the predictive points, you calculate loss by Loss=(y1-y1’)^2+(y2-y2’)^2>>0, you have to square because the positive and negative would cancel out
183
Mean Squared Error
similar to loss equation but calculating the mean Optimization using gradient descent, y=wx -> MSE(w), gradient gives direction were loss function increases, move where loss function is decreasing, keep doing this until we fin the w(minimum) which is where the loss function has its lowest value, then you can use the w in the line equation to get the line with the lowest lost, learning weight is how much we moved to the minimum
183
Supervised Learning Type
We train the model using data that is labeled, each example has a label and a feature We have training data that have a set of features that are called x, we include the correct labels which are the ys, features with the correct labels are used to train the model, the unseen data which has only the features is presented to the model, the model uses the training to predict what associated labels should be for each example, this will give us y prime
184
Unsupervised Learning Type
uncover structure within the data set itself Have a set of input data with no labels, have features only and send it to the model, the model then creates outputs which uncover a structure within the input data, maybe use this method to uncover personas within customer data
185
Reinforcement Learning
common in gameplaying and other types of ML You have an agent which is basically our model, the agent interacts with an environment(chess board in game of chess), the state of the environment(state(t)) is fed into our model at time t at a particular moment in time, the agent will propose an action at time t which affects the environment(move chess piece from one square to another), the state will change state(t+1), agent is rewarded, reward(t), to how well the action that it took at time t affected the environment in terms of the outcome it wants to get
186
Regression
predict real number (y’) There is a regression model and features are inputted into the model, and the model associates the set of features with the predicted model with a y’
187
Classification
predict class from specified set{A,B,C,D} with probability There is a classification model, it will take a set of features, then it will associate it with a particular class and an associated probability, ex: (X1,X2,…,Xn,A,0.9817)
188
Clustering
group elements into clusters or groups There is a clustering model, you input an element set, then you associate the element set with a cluster in the output
189
Transfer Learning
Train the model using images to identify them in specific categories, then you can copy the model and make it into a slightly newer classification model with new classification categories and you can train it with new images, on this new model, the training images can usually be far less
190
Underfitting
Like a Line, doesn’t capture underlying structure
190
Balanced
Curved that fits the data very well, still simple parabola
191
Overfitting
Fits the data very very well better than the other two cases ew data point with the curve there is a very large distance, the problem is it doesn’t generalize to new data
192
L1 Regularization
L1 Regularization term: |W1|+|W2|+…+|Wn^2|-take sum of the absolute value of all the weights Penalizes |weight”, drives weights of non-contributing features towards 0-sparse matrices
193
L2 Regularization
W1^2+W2^2+…+Wn^2-take sum of squares of all the weights Penalizes weight squared, drives all weights towards 0-simpler model
194
Avoid Overfitting
1. Regularization 2. Increase Training Data 3. Feature Selection 4. Early Stopping-don’t allow data to train for too long, only a certain amount of iterations 5. Cross Validation-take training data and split it into much smaller training sets, smaller sets known as folds 6. Dropout Layers-where weights are most likely all set to 0
195
Hyperparameters
Hyperparameters: values that need to be selected before the training process can begin Hyperparameter examples: Batch size, training epochs, number of hidden layers in network-model, regularization type-l1 or l2, regularization rate, learning rate Hyperparameter Characteristics 1. Selection: Hyperparameter values need to be specified before training begins 2. Model hyper parameter: relate directly to the model that is selected, Algorithm hyperparameters: relate to the training of the model 3. Training and Tuning: the process of finding the optimal or near optimal values for hyperparameters
196
Feature Engineering
One-hot Encoding(Categorical Data)- from a fixed set of values, can transform categorical data to numeric values, you can have columns that represent the categories and us binary 1 or 0 to indicate which ones are in each column Linear Scaling: transform values across one range into values in anothe, 1 to 1 or 0 to 1 Z-Score: can have a lot of values around mu and then you can transform them into a cluster around 0 Log Scaling: a small number of values have many points while the vast majority have few points but can transform it with X’=log(x) Bucketing: distribution where data points are related to each other, relationship not linear, create defined ranges, data points within range are aligned to a single data point
197
TensorFlow
1. Google’s open source, end to end, ML framework-good for machine learning 2. Compatible with wide range of hardware and devices-models can be trained and deployed across a wide range of hardware CPU GPU TCU 3. TensorFLow Lite-deploying models to mobile and embedded devices 4. TensorFlow.js-JavaScript library for training and deploying models on node.js 5. TensorFlow extended-deploying ML pipelines
198
Keras
1. Open Source neural network library 2. Made in Python 3. Runs ontop of other ML frameworks 4. High level API for fast experimentation-deep neural networks, easy to use extensible 5. CPUs and GPUs 6. tf.keras-implementation of Keras API specification-build and train models for first class support for TensorFlow specific functionality more flexible and makes tensor flow easier to use
199
Google Colab
1. Free Cloud service base on Jupyter Notebooks, free service with Jupyter Notebooks within Google Docs 2. Provides free GPU support 3. Supports some BASH commands 4. Includes pre-installed python libraries
200
Neural Network Layers
Input Layer: allows us to feed data unto the model Output Layer: represent the way we want the neural network to provide answers, we have a neuron available for each of the classes There are Hidden Layers between the Input and Output Layers, the number of hidden layers will vary based on the type of problem you are trying to solve-model hyperparameter Input=L0 Hidden=L1,L2,L3,L4,L5 Output=L6 Each layer has a specified number for neuron/nodes A pixel represents a matrix
201
Fully Connected Layer
where every neuron of 1 layer is connected to every neuron in the following layer
202
Partially Connected Layer
where every neuron in one layer is not connected to every neuron in the adjacent layer
202
Neurons
1. Will have an input vector X and an output value Y 2. Has a weight vector number of weights in a weight vector will correspond to the number of inputs in the input vector 3. X1*W1-each weight is multiplied by the corresponding input value 4. Then the values are summed-(X1*W1)+(X2*W2) 5. A function is applied to these sum values-f((X1*W1)+(X2*W2)) which gives us our output y 6. The function is called the Activation function 7. Activation functions:Rectified linear units(ReLu): f(x)=max(0,x), Sigmoid/Logistic,Hyperbolic tangent Logits: Element in the layer Softmax function: converts Logits to probabilities-SUM(p)=1
203
Feed Forward Neural Network
1. Simplest and most common type of deep neural network 2. Applications in computer vision and NLP 3. First type of deep neural network to be used 4. Information flows in one direction only(input to output)
203
Recurrent Neural Networks
1. Flow of information forms cycles/loops 2. Directed cycles 3. RNNs can be difficult to train 4. RNNs are dynamic with their state continuously changing 5. There are cycles
204
Convolutional Neural Network
1. Neural network commonly used for visual learning tasks 2. Common uses: Image and video recognition, image classification, natural language processing 3. Have an input, output, and many hidden layers 4. Convolutional layers are paired with pooling layers 5. They layer applies small filters that make out certain features of an image
205
Pooling Layer
1. Simplifies (downsamples) inputs 2. Usually succeed a non-linear activation function 3. There is Average Pooling-Calculated average value 4. There is Max Pooling-Calculates the maximum value 5. Strides move along the pixels in the image to calculate results for the output
206
GANS
1. Gans are deep neural networks that are composite of two opposing neural networks: Generator network, Discriminator network adversarial 2. Allows for the creation of things like: Images, music, speech, poetry, deep fakes(face imitation) 3. A generator network outputs these generated images to go into a Discriminator network to see which images are real or fake, it will predict if the image is realer fake and will send feedback
207
Vision APIs
Optical Character Recognition (OCR): Can detect text in images, there is text detection(Can see text, uses JSON extraction and puts text in block, can also read handwriting) and document text detection(Can also read text from images but more meant for text in the documents of a file) Cropping Hints: Gives a cropping suggestion on how to crop the image and gives the vertices to crop it Face Detection: Can detect faces and facial features Image Property Detection: Sees dominant colors, these colors can group images/object or be used for recommendations Label Detection: Can detect object, locations, activities, animal species, products, and much more Landmark Detection: Detects landmark in an image, gives the landmark name, bounding polygon vertices, and the location using latitude and longitude There is also Logo Detection Explicit Content Detection-safe search detects if the image is adult, spoof, medical, violence, and racy Web Entity and Page Detection-where is the image is being used, links to web pages, similar images, best guess labels
208
Video Intelligence APIs
Detect Labels: annotates videos where entities are detected, list of video segments, frame annotations and shots Shot change detection: annotates videos based on shots or scenes, entities associated with specific scenes Detect explicit content: detect adult content, annotates explicit content, and timestamps where detected Transcribe Speech: transcribes spoken words, profanity filtering, transcription hints, automatic punctuation, handling multiple speakers Track objects: track multiple objects, provides location of each object with frames, bounding boxes for each object, time segments with offset Detect Text: OCR on text occurring within videos, text and location of the text Google Knowledge graph Search API allows you to do searches on Google’s Knowledge Graph-all entities and relationships between them, each object has an identifier 1. Getting ranked list of most notable entities that match criteria 2. Predictively completing entities within a search box 3. Annotating or organizing content using the knowledge graph entities
209
Natural Language API
Looks at patterns within language/text, uses sentiment analysis, entity analysis, syntax analysis, entity-sentiment analysis, and content analysis
210
Sentiment Analysis
1. Score: Indicates overall emotion, ranges between -1 and 1(positive) 2. Magnitude: How much emotional content is there, number ranges from 0 and infinity, not normalized (proportional ti length of document being assessed)
211
Entity Analysis
1. Identifies entities within text 2. Provides information on the identified entities 3. Entities are nouns/things-Proper Nouns(Specific like Albert Einstein) and Common Nouns(mug=any mug)
212
Entity-Sentiment Analysis
1. Combines Entity and Sentiment Analysis 2. Tries to determine sentiment expressed towards each of the identified entities 3. Numerical. And magnitude scores
213
Syntax Analysis and Content Classification
Syntax Analysis: Takes streams of text through Sentences, Tokenization of text/streams breaks them up into tokens, sentences and tokens determine grammatical information Content Classification: API will return categories that are most specific to the source text
214
Dialogflow
1. Natural Language interaction platform 2. Mobile and web app, devices bots 3. Analyzes text or audio inputs 4. Responds to using text or speech 5. Intents; categorized an end user’s intention-understands and responds-like a classification/object 6. Intents have different training phrases that are mapped onto an intent, then we can have extracted parameters 7. Each parameter has a type called an entity type 8. End user gives an input phrase, then it goes to an agent, then we get intents through the intent classification, then from this we get parameters, that can do an action, from there we get a response that will go back to the end user
214
Cloud Speech to Text: Has audio files and audio stream
1. Synchronous Recognition: Returns result after all input audio has been processed 2. Asynchronous Recognition: Initiates long running operation 3. Streaming Recognition: Audio data is provided within a gRPC bi-directional stream 4. Models: video, phone call, command and search, default
215
Text-to-Speech API
1. Uses text files or Speech Synthesis Markup Language(SSML)-allows you to control the way text is converted to speech 2. SSML: Pauses, play sounds, speak cardinals, speak ordinals, speak characters, phrase substitution
216
AutoML
Suite of ML Products Facilitates the training of custom ML models Highly performant Speed of Delivery Human labelling service When you ask a problem and give it a potential result, it needs to find a neural network using NN Search When it gets a neural network that works from the NN Bank, it uses transfer learning to the AutoML where it is easily trained to handle novel data
216
AutoML Process
1. Prepare and managed images-label training images, create dataset(single or multi-label) 2. Training models-requires prepared dataset, may take a few hours to complete, training creates a new model 3. Evaluating models-after training on a test set, aggregated and detailed information(are under curve, confidence threshold curve, confusion matrix) 4. Deploying models-deploy before making online predictions 5. Making predictions-individual or bulk 6. Undeploying models-after successful predictions, have a better model, cost implications of not underlying models
217
Vision Edge
1. Export custom trained models 2. Models optimized for edge devices 3. TensorFlow Lite, Core ML, container export formats 4. Edge TPUs, ARM and NVIDIA 5. AutoML Vision Edge in ML Kit
218
AutoML Translation
AutoML Translation-you can use specific to translate from English to French phrase AutoML Translation Training-you use source target pairs, source sentences are in the source language, target sentences are in the target language Translation Considerations 1. Data Coverage-include examples of vocal, usage, and grammatical peculiarities that are specific to your domain, model need to be exposed to the language in some form 2. Human Involvement-people who understand both languages should be involved 3. Data Quality- This is VERY IMPORTANT for translation training, source and target documents need to align
218
AutoML Natural Language
1. Create custom models to classify content into custom categories you define 2. When pre-defined categories are insufficient 3. When you want to categorize content from free text 4. You want to create own categories for categorization
219
AutoML Table Capabilities
1. Data Support: AutoML Tables provide info on missing data, AutoML tables provide correlations, cardinality, and distributions for each feature 2. Automatic Feature Engineering: Normalizes and bucketizes numerical features, creates one-hot encoding and embeddings for categorical features, performs basic text processing for text fields, extract time and date features from timestamp features 3. Model Training: Parallel testing of multiple model types like Linear and Feed forward deep neural network, selects best model for predictions
220
AutoML vs BQ
AutoML Tables vs BigQuery ML 1. BigQuery: Rapid Iteration, still deciding on features to include 2. AutoML: Optimizing model quality, have time available for model optimization, multifarious input features
221
Kubeflow
1. ML Toolkit for Kubernetes 2. Data modeling with Jupyter Notebooks 3. Tuning and training with TensorFlow 4. Model serving and monitoring Production Phase: Transform Data with pipelines, Train Model with (MPI, MXNET, PyTorch), Serve Model using (TFServing,KFServing,NVIDIA TensorRT), Monitor the model using(TensorBoard, Metadata) Pipeline: Description of a ML workflow, including all of the components in the workflow and how the components relate to each other in the form of a graph Pipeline Component: Self-contained set of user code, packaged as a container, that performs one step int he pipeline
221
AI Platform
Ingest Data: Cloud Storage, Cloud Storage Transfer Service Prepare and Preprocess Data: Cloud Dataflow, Cloud Dataproc, BigQuery, Cloud Dataprep AI Platform data labeling service can label training data by applying classification, object detection, and entity extraction Develop and Train Models: Deep Learning VM, AI Platform Notebooks, AI Platform Training, KubeFlow Test and Deploy Models: TensorFlow Extended, AI Platform Prediction, Kubeflow Discovery: Google AI Hub Quick and Ready to Go ML
222
IAM Best Practices
The principle of Least privilege: Used predefined roles specific to each GCP product or service Each stack have its own boundary Policies applied to a parent object will be inherited by a child object Use Groups in G-Suite or Cloud Identity, grant roles to groups and not individual users
222
Service Accounts
Service Accounts: Special type of Google account designed to represent non-human users 1. Virtual Machines-act via SAs that determine the services they can access 2. Programmatic access-should always be achieved using a SA 3. IAM Roles-assigned to SAs in exactly the same way as a human user accounts
223
Human User Accounts
Human User Accounts: Passwords and multi factor authentication Service Accounts have Keys that can be downloaded in JSON format A Key can be used by an application to authenticate against Google APIs, keys rotated and massively protected Cloud IAM API: Request OAUTH, OPENID, JWT Service Account User Role-can impersonate actual SA and have access to everything/IAM policies
224
Data Security
Offers encryption in flight and at rest Can use Cloud Key Management Service (KMS) if you want to make your own keys Keys can be grouped together in key rings and can be used in multiple GCP services Limit Blast Radius VPC Service Controls-define security perimeter, only access services inside the perimeter GCP Security Command Center: Asset Management Features, Web Security Scanner, Anomaly Detection, Threat Detection-internal
225
Data Privacy
Should people have access to all the data? What is the data? Do we need this to complete the task? Are we allowed to store it?-PII Personally Identifiable Information-personal information to identify a specific individual GCP Cloud Data Loss Prevention: Text, Images, Pseudo-Anonymization(dummy data), Risk Analysis DLP API can become expensive
226
Industry Regulation
FEDRAMP: Department of Defense, Homeland Security, USA-how data is used and stored securely by cloud vendors, high compliance for most GCP Services Children’s Online Privacy Protection Act: Use of PII data for children under age of 13, incorporate parent consent, clear private policy, justification for data collection HIPAA:Protects Personal Health information, acceptance of business associate agreement Pci DSS Compliant: GCP certified compliant it is secure enough, apps compliant GDPR: Europe, protects personal data of EU citizens, any region that interacts with EU
227
Dataprep Overview
Explores, cleaning and preparing data Visually define transformations Export to Cloud Dataflow Integrated partner service from Trifacta-links to GCP project and datasets
228
Flows
Top level container for bringing together and organizing datasets recipes and transformations to one place
229
Datasets
Collections of data that we will use from Dataprep Flow-can import datasets from local machine, Cloud Storage or BigQuery
230
Recipes
Like an instruction manual, series of steps that perform a series of transformations on the datasets-create new data sets For recipes, there are audio visual controls to do these transformations You can also see a visual preview Can use the Automator to execute certain recipes at certain times Flows are then executed based on Cloud Dataflow
231
Cloud Storage
Unstructured object storage, Regional/dual-region-or multi-region, standard, near line, or cold line, storage event triggers fully managed object storage -can store images, access via API, SDKs, also has multiple storage classes like lifecycle management for objects and buckets, very secure and durable
232
Cloud Bigtable Def
Petabyte-scale NoSQL database, High-throughput and scalability-Wide column key/value data, Time-series, transactional, IoT data
233
BQ Def
Petabyte scale data warehouse, Fast SQL queries across large datasets, Foundations for BI and AI, and has Public Datasets
234
Cloud SQL
Managed MySQL + PostageSQL instances, built in hacks, replicas, and failover, vertically scalable, SQL server
235
Cloud Spanner
Global SQL-based relational database, horizontal scalability and high availability, strong consistency, good for financial sector
236
Cloud FIrestore
Fully managed NoSQL document database, large collections of small JSON documents, provides a real time database SDKs
237
Cloud Memorystore
Managed Redis instances, in-memory DB cache or message broker, built in high availability, vertically scalable
238
Data Modeling
structured data-consistent model, model maybe in place, data could require prep or transformation. 3 stages: Conceptual-what are entities/relationships, Logical-what are the structures of the entities, can the model be normalized? Physical-how will I implement this into the database?What keys or indexes do I need?
239
Relational Data vs Non Relational Data
good schema design, normalization and reducing waste, have accuracy and integrity-data types and tables Non-relational-simple key/value store or document store, high volume columnar database-NoSQL Pipeline could be going from datable to big query
240
Bucket
tore object or files, name unique, exists within projects, regional-low cost, dual-regional, and multi-regional all geo redundancy Storage classes: Standard $0.02 per GB, Nearline $.01 cent per GB and 30 days minimum storage and data retrieval fee, Cold line 90 day min storage $0.004 per
241
GCS Info
GB, Archive storage for at least a year min $0.0012 GB and data retrieval fee Objects are stored as opaque data, object immutable, overwrites atomic, can be versioned Access through google cloud console, HTTP API, SDKs and gsutil command in terminal, parallel uploads, transcoding, integrity checking, requestor pays GCS Costs: operation charges Class A expensive uploading B downloading, also network charges-like retrieving data from a bucket, data retrieval charges. Can apply life cycle rules to a bucket, IAM access for buckets, ACLS for granular access or signed policy documents-IAMs: has members and roles
242
Service Accounts Best Practices
IAM for bulk access to buckets, roles assigned to members, ACLs for granular access to buckets, ACLs grant permissions to a scope, IAM is more recommended
243
Data Transfer Service
Cloud Storage-source to sink, http, amazon s3 cloud storage, filters names and dates, schedule transfers, delete objects in destination bucket, delete objects in source bucket Full access: storage transfer.admin, Submit transfers: storage transfer.user, List jobs and operations: storage transfer.viewer
244
BQ Transfer Service
automates data transfer to Bigquery, data loaded on a regular basis, backfills can recover gaps, google marketing sources, sources in beta
245
Google Transfer Device
very very large amounts of data , physical storage device that is attached to the server terabyte versions, security guaranteed
246
Human Accounts Cont
Users are human users, authenticate with one credentials -not used for non human operations, passwords could leak
247
Service Accounts Cont
Created for a specific non human task for granular authorization , identity can be assumed by an application keys can also be easily rotated, there are google and user manages service accounts SAs are managed by keys, user managed keys are downloadable JSON File - very powerful