1. Structural or Processed: Data is organized, may have been transformed, and is stored in a structural way 2. Ready to use: Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed 3. Rigid: Data may be easier to understand, but less up-to-date. Structures are hard to change

1. Raw or Unstructured: The data lake contains all raw unprocessed data, before any kind of transformation or organization 2. Ready to analyze: Data is more up to date, but may require more advanced tools for analysis 3. Flexible: No structure is enforced, so new types of data can be added at any time

- High volume of short transactions - Fast queries - high integrity MODIFY DATA

- Low volume of long-running queries - Aggregated historical data QUERY DATA

Big Data Flashcards by Расана Крестелева

What defines Big Data (3V)

Volume, Velocity, Veracity

How well did you know this?

Not at all

Perfectly

What is Volume

The scale of information being handled by data processing system

How well did you know this?

Not at all

Perfectly

What is Velocity

The speed at which data is being processed: ingested, analyzed, and visualized

How well did you know this?

Not at all

Perfectly

What is Variety

The diversity of data sources, formats, and quality

How well did you know this?

Not at all

Perfectly

Data Warehouses

Structural or Processed: Data is organized, may have been transformed, and is stored in a structural way
Ready to use: Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed
Rigid: Data may be easier to understand, but less up-to-date. Structures are hard to change

How well did you know this?

Not at all

Perfectly

Data Lakes

Raw or Unstructured: The data lake contains all raw unprocessed data, before any kind of transformation or organization
Ready to analyze: Data is more up to date, but may require more advanced tools for analysis
Flexible: No structure is enforced, so new types of data can be added at any time

How well did you know this?

Not at all

Perfectly

OLTP

High volume of short transactions
Fast queries
high integrity

MODIFY DATA

How well did you know this?

Not at all

Perfectly

OLAP

Low volume of long-running queries
Aggregated historical data

QUERY DATA

How well did you know this?

Not at all

Perfectly

Stages of a Data Pipeline

Ingestion
Storage
Processing
Visualization

How well did you know this?

Not at all

Perfectly

Data ingestion Technical Challenges

choose the correct compute and storage options. Otherwise, a solution can be too expensive or too slow
data should have value
security of data

How well did you know this?

Not at all

Perfectly

Common data transformations

formatting
labeling
filtering
validating

How well did you know this?

Not at all

Perfectly

Stages of Data Modeling

Conceptual. What are the entities in my data? What are their attributes and relationships?
Logical
Physical

How well did you know this?

Not at all

Perfectly

Google Cloud Storage (GCS)

Fully managed object storage
For unstructured data: images, videos. Access via API or programmatic SDKs
Multiple storage classes
Instant access in all classes. Lifecycle management for objects and buckets
Secure and durable
Secure access control. High availability and maximum durability

How well did you know this?

Not at all

Perfectly

Google Cloud Storage concepts (buckets)

a bucket is a logical container for objects
buckets exist within projects
bucket names exists within a global namespace
bucket can be:
- regional
- dual-regional
- nulti-regional

How well did you know this?

Not at all

Perfectly

Storage classes in GCS

Standard
Nearline
Coldline
Archive

How well did you know this?

Not at all

Perfectly

Standard storage class in GCS

Study These Flashcards

minimum storage: -
storage fee (per Gb): $0.02
retrieval fee: -
regional availability: 99.99%
multi and dual reg.: > 99.99%

Nearline storage class in GCS

Study These Flashcards

minimum storage: 30 days
storage fee (per Gb): $ 0.01
retrieval fee: $ 0.01
regional availability: 99.9%
multi and dual reg.: 99.95%

Coldline storage class in GCS

Study These Flashcards

minimum storage: 90 days
storage fee (per Gb): $0.004
retrieval fee: $ 0.02
regional availability: 99.9%
multi and dual reg.: 99.95%

Archive storage class in GCS

Study These Flashcards

minimum storage: 365 days
storage fee (per Gb): $0.0012
retrieval fee: $0.05
regional availability: 99.9%
multi and dual reg.: 99.95%

Objects in Google Cloud Storage

Study These Flashcards

Objects are stored as opaque data
Objects are immutable
Overwrites are atomic
Objects can be versioned (optionally)

Accessing Buckets and Objects

Study These Flashcards

Google Cloud Console
HTTP API
SDKs
gsutil (command line tool)

Advanced features of Google Cloud Storage

Study These Flashcards

Parallel uploads of composite objects
Integrity checking
Transcoding
Requestor pays

Google Cloud Storage Costs

Study These Flashcards

operation charges
network charges
data retrieval charges

Google Cloud storage Lifestyle management

Study These Flashcards

apply a lifestyle configuration to a bucket
GCS periodically checks configuration
matching rules applied to objects
delete objects or set storage classes

lifestyle management configuration file is a JSON-file

Security and Access Control in GCS

- IAM for bulk access to buckets - ACLS* for granular access to buckets - Signed URLs for temporary access - Signed policy documents * Access control lists

Amazon analog of Google Cloud SQL

Amazon RDS

Google Cloud SQL

- Managed SQL instances Automate instance and database creation, replication, backups, patches and updates - Multiple database engines MySQL 5.6 and 5.7, PostgreSQL 9.6 or 11, SQL Server in beta - Scalability and availability Vertically scale to 64 cores and 416 Gb RAB. Live migration and less configurations

Google Cloud Firestore replaces...

Cloud DataStore

Amazon analog of Google Cloud Firestore

Amazon DynamoDB

Google Cloud Firestore

- Fully managed NoSQL database Serverless autoscaling NoSQL document store. Integrated with GCP and Firebase - Realtime DB with mobile SDK Android and IOS client libraries, frameworks for all popular programming languages - Scalability and consistency Horizontal autoscaling and strong consistency, with support of ACID transactions

Firestore Data Model

- it's a document store - a document is just some JSON data - documents bundled together into a collection - documents can contain nested sub-collection - references

Firestore supported datatypes

- String, integer, boolean, float, null - Bytes, date and time, geographical point - Array and map - Reference (to document)

Indexes in Cloud Firestore

- Automatic single-field indexes - Index exemption - Composite indexes

Google Cloud Spanner

- Managed SQL-compiant DB SQL (ANSI 2011) schemas and queries with ACID transactions - Horizontally scalable Strong consistency across rows, regions from 1 to 1000 of nodes - Highly available Automatic global replication, no planned downtime and 99.999% SLA

CAP theorem

either 2 of 3: - Consistency - Availability - Partition tolerance

Google Cloud Spanner is (CAP)

CP system. Sometimes it sacrifices availability for consistency

Big Data Flashcards

(36 cards)