ACG Notes Flashcards
(163 cards)
3 V’s of Big Data
- ) Volume – Scale of data being handled by systems (can it be handled by a single server?)
- ) Velocity – speed in which its being processed
- ) Variety – The diversity of data sources, formats, and quality
What is a Data Warehouse?
- ) Data Warehouse
a. Structured and/or processed
b. Ready to use
c. Rigid structures – hard to change, may not be the most up to date either
What is a data lake?
- ) Data Lake
a. Raw and/or unstructured
b. Ready to analyze – more up to date but requires more advanced tools to query
c. Flexible – no structure is enforced
4 Stages of a Data Pipeline
- ) Ingestion
- ) Storage
- ) Processing
a. ETL – Data is taken from a source, manipulated to fit the destination
b. ELT – data is loaded into a data lake and transformations can take place later
c. Common transformations: Formatting / Labeling / Filtering / Validating - ) Visualization
Cloud Storage
o Unstructured object storage o Regional, dual-region, or multi-region o Standard, nearline, or cold line o Storage event triggers (pub/sub) (Usually, first steps in a cloud data pipeline)
Cloud Bigtable
o Petabyte-scale NoSQL database
o High-throughput and scalability
o Wide column key/value data
Time-series, transactional, IoT data
Cloud BigQuery
o Petabyte-scale analytics DW
o Fast SQL queries across large datasets
o Foundations for BI and AI
o Useful public datasets
Cloud Spanner
o Global SQL-based relational database
o Horizontal scalability and HA
o Strong consistency
o Not cheap to run, usually used in financial transactions
Cloud SQL
o Managed MySQL, PostgreSQL, and SQL Server instances
o Built-in backups, replicas, and failover
o Does not scale horizontally, but does scale vertically
Cloud Firestore
o Fully managed NoSQL document database
o Large collections of small JSON documents
o Realtime database with mobile SDKs
o Strong consistency
Cloud Memorystore
o Managed Redis instances
o In-memory DB, cache, or message broker
o Bult-in HA
o Vertically scalable with increasing the amount of RAM
Cloud Storage (GCS) at a high level
o Fully managed object storage
o For unstructured data: Images, videos, etc.
o Access via API or programmatic SDK
o Multiple storage classes
o Instant access in all classes, also has lifecycle management
o Secure and durable (HA and maximum durability)
GCS Concepts, what is GCS, where can buckets be?
o A bucket is a logical container for an object
o Buckets exist within projects (named within a global namespace)
o Buckets can be:
o Regional $
o Dual-regional $$ (HA)
o Multi-regional $$$ (Uses all datacenters in a region, lowest latency as well)
4 GCS Storage Classes
o Standard
o Nearline
o Coldline
o Archive
Describe standard GCS Storage Class
$0.02 per GB
99.99% regional availability
>99.99% availability in multi and dual-regions
Describe Nearline GCS Storage Class
30 day minimum storage
$0.01 per GB up / down
99.9% regional availability
99.95% availability in multi and dual regions
Describe Coldline GCS Storage Class
90 days minimum storage $.004 per GB stored $02 up/down 99.9% regional availability 99.95% availability in multi and dual region
Describe Archive GCS Storage Class
365 minimum storage $.0012 per GB stored $0.05 per GB up/down 99.9% regional availability 99.95% availability in multi and dual regions
Objects in cloud storage (encryption, changes)
o Encrypted in flight and at rest
o Objects are immutable to change you must overwrite (atomic operation)
o Objects can be versioned
name the 5 “advanced” features of GCS
o Parallel uploads of a single object
o Integrity checking – pre-calculate an md5 hash, compared to the one Google calculates
o Transcoding for compression
o Requester can pay, if desired
o Pub/Sub notifications
New files are commits that trigger a data pipeline
What is Cloud Transfer Service?
o Transfers from a source to a sink (bucket), supported sources: S3, HTTP, GCP Storage
o Transfers can be filtered based on names/dates
Schedule it for one run or periodically (can delete in source or destination after transfer is confirmed)
What is BigQuery Data Transfer Service?
o Automates data transfer to BigQuery
o Data is loaded on a regular basis
o Backfill can recover from gaps or outages
o Supported sources: Cloud Storage, Merchant Center, Google Play, S3, Teradata, Redshift
What is a transfer appliance?
physical rack storage device, 100 TB and 480 TB versions
What are the top 3 features of Cloud SQL?
o Managed SQL instances (creation, replication, backups, patches, updates)
o Multiple DB engines (MySQL, PostgreSQL, SQL Server)
o Scalability – vertically to 64 cores and 416 GB of RAM, HA options are available