Big Data Flashcards

(36 cards)

1
Q

What defines Big Data (3V)

A

Volume, Velocity, Veracity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Volume

A

The scale of information being handled by data processing system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Velocity

A

The speed at which data is being processed: ingested, analyzed, and visualized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Variety

A

The diversity of data sources, formats, and quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Warehouses

A
  1. Structural or Processed: Data is organized, may have been transformed, and is stored in a structural way
  2. Ready to use: Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed
  3. Rigid: Data may be easier to understand, but less up-to-date. Structures are hard to change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Lakes

A
  1. Raw or Unstructured: The data lake contains all raw unprocessed data, before any kind of transformation or organization
  2. Ready to analyze: Data is more up to date, but may require more advanced tools for analysis
  3. Flexible: No structure is enforced, so new types of data can be added at any time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

OLTP

A
  • High volume of short transactions
  • Fast queries
  • high integrity

MODIFY DATA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

OLAP

A
  • Low volume of long-running queries
  • Aggregated historical data

QUERY DATA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Stages of a Data Pipeline

A
  1. Ingestion
  2. Storage
  3. Processing
  4. Visualization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data ingestion Technical Challenges

A
  • choose the correct compute and storage options. Otherwise, a solution can be too expensive or too slow
  • data should have value
  • security of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Common data transformations

A
  • formatting
  • labeling
  • filtering
  • validating
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Stages of Data Modeling

A
  1. Conceptual. What are the entities in my data? What are their attributes and relationships?
  2. Logical
  3. Physical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Google Cloud Storage (GCS)

A
  • Fully managed object storage
    For unstructured data: images, videos. Access via API or programmatic SDKs
  • Multiple storage classes
    Instant access in all classes. Lifecycle management for objects and buckets
  • Secure and durable
    Secure access control. High availability and maximum durability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Google Cloud Storage concepts (buckets)

A
  • a bucket is a logical container for objects
  • buckets exist within projects
  • bucket names exists within a global namespace
  • bucket can be:
    - regional
    - dual-regional
    - nulti-regional
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Storage classes in GCS

A
  • Standard
  • Nearline
  • Coldline
  • Archive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Standard storage class in GCS

A

minimum storage: -
storage fee (per Gb): $0.02
retrieval fee: -
regional availability: 99.99%
multi and dual reg.: > 99.99%

17
Q

Nearline storage class in GCS

A

minimum storage: 30 days
storage fee (per Gb): $ 0.01
retrieval fee: $ 0.01
regional availability: 99.9%
multi and dual reg.: 99.95%

18
Q

Coldline storage class in GCS

A

minimum storage: 90 days
storage fee (per Gb): $0.004
retrieval fee: $ 0.02
regional availability: 99.9%
multi and dual reg.: 99.95%

19
Q

Archive storage class in GCS

A

minimum storage: 365 days
storage fee (per Gb): $0.0012
retrieval fee: $0.05
regional availability: 99.9%
multi and dual reg.: 99.95%

20
Q

Objects in Google Cloud Storage

A
  • Objects are stored as opaque data
  • Objects are immutable
  • Overwrites are atomic
  • Objects can be versioned (optionally)
21
Q

Accessing Buckets and Objects

A
  • Google Cloud Console
  • HTTP API
  • SDKs
  • gsutil (command line tool)
22
Q

Advanced features of Google Cloud Storage

A
  • Parallel uploads of composite objects
  • Integrity checking
  • Transcoding
  • Requestor pays
23
Q

Google Cloud Storage Costs

A
  • operation charges
  • network charges
  • data retrieval charges
24
Q

Google Cloud storage Lifestyle management

A
  • apply a lifestyle configuration to a bucket
  • GCS periodically checks configuration
  • matching rules applied to objects
  • delete objects or set storage classes

lifestyle management configuration file is a JSON-file

25
Security and Access Control in GCS
- IAM for bulk access to buckets - ACLS* for granular access to buckets - Signed URLs for temporary access - Signed policy documents * Access control lists
26
Amazon analog of Google Cloud SQL
Amazon RDS
27
Google Cloud SQL
- Managed SQL instances Automate instance and database creation, replication, backups, patches and updates - Multiple database engines MySQL 5.6 and 5.7, PostgreSQL 9.6 or 11, SQL Server in beta - Scalability and availability Vertically scale to 64 cores and 416 Gb RAB. Live migration and less configurations
28
Google Cloud Firestore replaces...
Cloud DataStore
29
Amazon analog of Google Cloud Firestore
Amazon DynamoDB
30
Google Cloud Firestore
- Fully managed NoSQL database Serverless autoscaling NoSQL document store. Integrated with GCP and Firebase - Realtime DB with mobile SDK Android and IOS client libraries, frameworks for all popular programming languages - Scalability and consistency Horizontal autoscaling and strong consistency, with support of ACID transactions
31
Firestore Data Model
- it's a document store - a document is just some JSON data - documents bundled together into a collection - documents can contain nested sub-collection - references
32
Firestore supported datatypes
- String, integer, boolean, float, null - Bytes, date and time, geographical point - Array and map - Reference (to document)
33
Indexes in Cloud Firestore
- Automatic single-field indexes - Index exemption - Composite indexes
34
Google Cloud Spanner
- Managed SQL-compiant DB SQL (ANSI 2011) schemas and queries with ACID transactions - Horizontally scalable Strong consistency across rows, regions from 1 to 1000 of nodes - Highly available Automatic global replication, no planned downtime and 99.999% SLA
35
CAP theorem
either 2 of 3: - Consistency - Availability - Partition tolerance
36
Google Cloud Spanner is (CAP)
CP system. Sometimes it sacrifices availability for consistency