GCP Professional Data Engineer Cert Flashcards
(264 cards)
Relational Databases
Has relationship between tables
Google Cloud SQL: Managed SQL instances-don’t have to set much up, Multiple database engines like MySQL, Scalability and availability vertically scales to 64 cores, MySQL has different instances it is also secure-Cloud SQL proxy or SSL/TLS, or have private IPs there are also maintenance windows and automated backups , point in time recovery instance stores
Importing MySQL Data Commands: InnoDB mysqldump export/import, CSV import, External replica promotion-need binary log retention
PostgreSQL Instances are another option-have automated maintenance, unsupported features but it has high availability
Import PostgreSQL commands: SQL dump export/import, CSV import
Cloud Firestore
- Fully managed No SQL database-server less autoscaling, NoSQL document store
- Realtime DB with mobile SDKs, Android and IOS client libraries, frameworks for popular programming languages
- Strong scalability and consistency-horizontal autoscaling
Bundle multiple documents=collection
Messages are sub collections
Cloud Spanner
- Managed SQL -compliant DB-SQL schemas and queries with ACID transactions
- Horizontally scalable: Strong consistency across rows, regions from 1 to 1,000s of nodes
- Highly available-automatic global replication, no planned downtime and 99.9999% SLA
High Cost
CAP Theorem
Consistency-one change data with specific rules, Availability-always available to do queries, Partition Tolerance-needs to tolerate failures and has to be tolerant of any loss of partition parts
Most likely you will have two parts at once
Spanner is strongly consistent and highly available, sometimes it will choose consistency over availability, global private network, five 9s of availability
Cloud Spanner Architecture
An allocation of resources, instance configuration-regional or multi-regional, initial number of nodes
Region configuration: a region has a zone/multiple zones. With each instance, specify the node count as 1 and each replica is powered by each virtual machine, by moving the node number up you are adding more machines for more computing power. The replicas stay the same, but machines/nodes can change. Therefore, you can connect the different replicas across different zones to create a node.
Cloud Memorystore
In memory database
1. Fully managed Redis Instance-provisioning, replication, failover-fully automated
2. Basic tier: efficient cache that that can withstand a cold restart and a full data flush
3. Standard tier-adds a cross-zone replication and automatic failover
Benefits-no need to provision own VMs,scale instances with minimal impact, private IPs and IAM, automatic replication and failover
Creating an Instance: Version 3.2 or 4, choose service tier and region, memory capacity 1-300GB(Determines network throughput), add configuration parameters
Connecting to Instances: Compute Engine, Kubernetes Engine, App Engine, Cloud Function(server-less VPC connector)
Import and Export: Export to RDB backup: BETA, admin operation not permitted during esport, may increase latency, RDB file written to Cloud Storage
Import from RDB backup: Overwrites, all current instance data, instance unavailable during import process
Use Cases: Redis can be used as a Session Cache that the common uses are logins, and shopping carts, a Message Queue that queues messages and operates to enable loosely-coupled services, or a Pub/Sub advanced message
Comparing Storage Options
- Ask yourself if this is structured or unstructured data? Structured: SQL data, NoSQL data, Analytics data, Keys and values Unstructured: Binary blobs, videos, images, proprietary files-unstructured data use the Cloud Storage Option
- Is the data going to be used for Analytics? Low Latency vs Warehouse. Low latency: Petabyte scale, single-key rows, time series or IoT data.-Choose Cloud Bigtable Warehouse: petabyte scale, analytics warehouse, SQL queries-Choose Bigquery
- Is this relational data? Horizontal Scaling vs Vertical scaling. Horizontal Scaling: ANSI SQL works, global replication, high availability and consistency, it’s expensive but can the client afford it, most financial institutions would probably use this-Choose Cloud Spanner. Vertical scaling: MySQL or PostgreSQL, managed service, and high availability-Choose Cloud SQL.
- Is the data Non-relational? NoSQL vs Key/Value. NoSQL: Fully managed document database, strong consistency, mobile SDKs and offline data-Choose Cloud Firestore. Key/Value: managed Redis instances, does what Redis does-Choose Cloud Memorystore.
Streaming
Continuous collection of data, near real time analytics, windows and micro batches
Batch
Data gathered with a defined time window, large volumes of data, data from legacy systems
No-SQL
Anything not sql-key values stores, json document stores, mongoDB and Cassandra tools
SQL
Row tabular data ,Relational-connect to other tables/queries
On-Line Analytical Processing (ONLAP)
Low volume of long running queries
Aggregated historical data-purchasing analytics
On-Line Transactional Processing(ONLTP)
High volume of short transactions, high integrity, sql
Modifies the database
Defines Big Data
- Volume: Scale of information being handled by data processing systems
- Velocity: Speed at which data is being processed, ingested, analyzed, and visualized
- Variety The diversity of data sources, formats, and quality.
Map Reduce
A programming model-Map and Reduce functions
Distributed Implementation
Created at Google to solve problems
Map Function
takes an input from the user, produces a set of intermediate key/value
Reduce Function
Merges intermediate values associated with the same intermediate key, forms a smaller set of values
This method standardized the framework, implementation abstracts away the distributed computing framework: Parallelizing and executing-partitioning, scheduling and fault tolerance
Splits all the jobs to small chunks
Master and worker cluster model
Failed worker jobs reassigned
Worker files buffered to local disk
Partitioned output files
Hadoop and HDFS
Named after a toy elephant-inspired by google file system-originated in Apache Dutch-sub project began in 2006
Modules: Hadoop Common-base model and has starting scripts, Hadoop Distributed File System(HDFS)-distributed fault tolerates system that runs on commodity hardware as part of a Hadoop cluster, Hadoop YARN-handles resource management tasks like job scheduling and monitoring for Hadoop jobs, Hadoop MapReduce-Hadoop’s own implementation of the MapReduce model which includes libraries for map and reduce functions, partitioning, reduction, and custom job configuration parameters
HDFS Architecture-can help with Cloud Dataproc
There is a Server-within the server there is a Name Node-within the Name Node, there is Metadata
In the other server-there is a Data Node which stores very large files across a cluster and the files are stored as a series of blocks
The Racks are in between the cluster to design the shortest network path possible
The client can make multiple requests to a name node across racks to get data from multiple nodes
Servers/clusters can be replicated for fault tolerance
The YARN architecture is similar but in the Server, it has a Node Manager and a Server can have a Resource Manager-The client sends jobs to the resource manager, then on individual workers, the Node Manager process runs to handle local resources, request tasks from the master and return the results
Apache Pig-A high level framework for running MapReduce jobs on Hadoop clusters
Platform for analyzing large datasets
Pig Latin defines analytics jobs: Merging, Filtering, and transformation-high level but like SQL simplicity
Good for ETL jobs since it has a procedural data flow
And it is an abstraction for MapReduce
The Apache Pig will compile our instructions into MapReduce jobs and then are sent to Hadoop for parallel processing across the cluster
Apache Spark
Linear flow of data was an issue- like reading mapping across data reduce results and writing to a disk
The Adobe Spark-General purpose cluster-computing framework-allows for concurrent computational jobs to be run across massive datasets
It uses general purpose cluster-computing framework, resilient distributed data multisets, working set as a form of distributed shared memory
Spark Modules
Spark SQL-structured data in spark stored in abstraction, programmatic querying-data frames API
Spark Streaming-streaming data ingestion in addition to batch processing-very small batches
MLLib-machine learning library, machine learning algorithms-classification, regression, decision trees
GraphX-iterative graph computation
Supports languages: Python, Java, Scala, R, SQL
MUST have 2 Things: A Cluster Manager-YARN or Kubernetes and a distributed Storage System-HDFS, Apache HBASE, and Cassandra
Hadoop vs Spark
Hadoop: Slow disk storage, high latency, slow, reliable batch processing
Spark: Fast memory storage, low latency, stream processing, 100x faster in-memory, 10x faster on disk, more expensive
Apache Kafka
Publish/subscribe to streams of records
Like a message bus but for data
High throughput and low-latency-ingesting millions events through devices
Ex: Handling >800 Billion messages a day at LinkedIn
Four main APIs in Kafka: Producer-allows app to stream records to a Kafka topic. Consumer-allows app to subscribe to one or more topics/process a stream of records contained within. Streams-an API designed to allow an application to be a stream processor itself-transform data then send it back to Kafka