GCP Data Engineer Quick Terms Flashcards
3 V’s of Big Data
1.) Volume – Scale of data being handled by systems (can it be handled by a single server?)2.) Velocity – speed in which its being processed3.) Variety – The diversity of data sources, formats, and quality
What is a Data Warehouse?
1.) Data Warehousea. Structured and/or processedb. Ready to usec. Rigid structures – hard to change, may not be the most up to date either
What is a data lake?
2.) Data Lakea. Raw and/or unstructuredb. Ready to analyze – more up to date but requires more advanced tools to queryc. Flexible – no structure is enforced
4 Stages of a Data Pipeline
1.) Ingestion2.) Storage3.) Processinga. ETL – Data is taken from a source, manipulated to fit the destinationb. ELT – data is loaded into a data lake and transformations can take place laterc. Common transformations: Formatting / Labeling / Filtering / Validating4.) Visualization
Cloud Storage
o Unstructured object storageo Regional, dual-region, or multi-regiono Standard, nearline, or cold lineo Storage event triggers (pub/sub) (Usually, first steps in a cloud data pipeline)
Cloud Bigtable
o Petabyte-scale NoSQL databaseo High-throughput and scalabilityo Wide column key/value data Time-series, transactional, IoT data
Cloud BigQuery
o Petabyte-scale analytics DWo Fast SQL queries across large datasetso Foundations for BI and AIo Useful public datasets
Cloud Spanner
o Global SQL-based relational databaseo Horizontal scalability and HAo Strong consistencyo Not cheap to run, usually used in financial transactions
Cloud SQL
o Managed MySQL, PostgreSQL, and SQL Server instanceso Built-in backups, replicas, and failovero Does not scale horizontally, but does scale vertically
Cloud Firestore
o Fully managed NoSQL document databaseo Large collections of small JSON documentso Realtime database with mobile SDKso Strong consistency
Cloud Memorystore
o Managed Redis instanceso In-memory DB, cache, or message brokero Bult-in HAo Vertically scalable with increasing the amount of RAM
Cloud Storage (GCS) at a high level
o Fully managed object storageo For unstructured data: Images, videos, etc.o Access via API or programmatic SDKo Multiple storage classeso Instant access in all classes, also has lifecycle managemento Secure and durable (HA and maximum durability)
GCS Concepts, what is GCS, where can buckets be?
o A bucket is a logical container for an objecto Buckets exist within projects (named within a global namespace)o Buckets can be:o Regional $ o Dual-regional $$ (HA)o Multi-regional $$$ (Uses all datacenters in a region, lowest latency as well)
4 GCS Storage Classes
o Standardo Nearlineo Coldline o Archive
Describe standard GCS Storage Class
$0.02 per GB99.99% regional availability>99.99% availability in multi and dual-regions
Describe Nearline GCS Storage Class
30 day minimum storage$0.01 per GB up / down99.9% regional availability99.95% availability in multi and dual regions
Describe Coldline GCS Storage Class
90 days minimum storage$.004 per GB stored$02 up/down99.9% regional availability99.95% availability in multi and dual region
Describe Archive GCS Storage Class
365 minimum storage$.0012 per GB stored$0.05 per GB up/down99.9% regional availability99.95% availability in multi and dual regions
Objects in cloud storage (encryption, changes)
o Encrypted in flight and at resto Objects are immutable to change you must overwrite (atomic operation)o Objects can be versioned
name the 5 “advanced” features of GCS
o Parallel uploads of a single objecto Integrity checking – pre-calculate an md5 hash, compared to the one Google calculateso Transcoding for compressiono Requester can pay, if desiredo Pub/Sub notifications New files are commits that trigger a data pipeline
What is Cloud Transfer Service?
o Transfers from a source to a sink (bucket), supported sources: S3, HTTP, GCP Storageo Transfers can be filtered based on names/dates Schedule it for one run or periodically (can delete in source or destination after transfer is confirmed)
What is BigQuery Data Transfer Service?
o Automates data transfer to BigQueryo Data is loaded on a regular basiso Backfill can recover from gaps or outageso Supported sources: Cloud Storage, Merchant Center, Google Play, S3, Teradata, Redshift
What is a transfer appliance?
physical rack storage device, 100 TB and 480 TB versions
What are the top 3 features of Cloud SQL?
o Managed SQL instances (creation, replication, backups, patches, updates)o Multiple DB engines (MySQL, PostgreSQL, SQL Server)o Scalability – vertically to 64 cores and 416 GB of RAM, HA options are available