Big Data Flashcards
(36 cards)
What defines Big Data (3V)
Volume, Velocity, Veracity
What is Volume
The scale of information being handled by data processing system
What is Velocity
The speed at which data is being processed: ingested, analyzed, and visualized
What is Variety
The diversity of data sources, formats, and quality
Data Warehouses
- Structural or Processed: Data is organized, may have been transformed, and is stored in a structural way
- Ready to use: Data exists in the warehouse for a defined purpose, and in a format where it is ready to be consumed
- Rigid: Data may be easier to understand, but less up-to-date. Structures are hard to change
Data Lakes
- Raw or Unstructured: The data lake contains all raw unprocessed data, before any kind of transformation or organization
- Ready to analyze: Data is more up to date, but may require more advanced tools for analysis
- Flexible: No structure is enforced, so new types of data can be added at any time
OLTP
- High volume of short transactions
- Fast queries
- high integrity
MODIFY DATA
OLAP
- Low volume of long-running queries
- Aggregated historical data
QUERY DATA
Stages of a Data Pipeline
- Ingestion
- Storage
- Processing
- Visualization
Data ingestion Technical Challenges
- choose the correct compute and storage options. Otherwise, a solution can be too expensive or too slow
- data should have value
- security of data
Common data transformations
- formatting
- labeling
- filtering
- validating
Stages of Data Modeling
- Conceptual. What are the entities in my data? What are their attributes and relationships?
- Logical
- Physical
Google Cloud Storage (GCS)
- Fully managed object storage
For unstructured data: images, videos. Access via API or programmatic SDKs - Multiple storage classes
Instant access in all classes. Lifecycle management for objects and buckets - Secure and durable
Secure access control. High availability and maximum durability
Google Cloud Storage concepts (buckets)
- a bucket is a logical container for objects
- buckets exist within projects
- bucket names exists within a global namespace
- bucket can be:
- regional
- dual-regional
- nulti-regional
Storage classes in GCS
- Standard
- Nearline
- Coldline
- Archive
Standard storage class in GCS
minimum storage: -
storage fee (per Gb): $0.02
retrieval fee: -
regional availability: 99.99%
multi and dual reg.: > 99.99%
Nearline storage class in GCS
minimum storage: 30 days
storage fee (per Gb): $ 0.01
retrieval fee: $ 0.01
regional availability: 99.9%
multi and dual reg.: 99.95%
Coldline storage class in GCS
minimum storage: 90 days
storage fee (per Gb): $0.004
retrieval fee: $ 0.02
regional availability: 99.9%
multi and dual reg.: 99.95%
Archive storage class in GCS
minimum storage: 365 days
storage fee (per Gb): $0.0012
retrieval fee: $0.05
regional availability: 99.9%
multi and dual reg.: 99.95%
Objects in Google Cloud Storage
- Objects are stored as opaque data
- Objects are immutable
- Overwrites are atomic
- Objects can be versioned (optionally)
Accessing Buckets and Objects
- Google Cloud Console
- HTTP API
- SDKs
- gsutil (command line tool)
Advanced features of Google Cloud Storage
- Parallel uploads of composite objects
- Integrity checking
- Transcoding
- Requestor pays
Google Cloud Storage Costs
- operation charges
- network charges
- data retrieval charges
Google Cloud storage Lifestyle management
- apply a lifestyle configuration to a bucket
- GCS periodically checks configuration
- matching rules applied to objects
- delete objects or set storage classes
lifestyle management configuration file is a JSON-file