Big Data Systems Flashcards
(18 cards)
What are big data systems?
Big data systems are designed to handle and process large and complex data sets, encompassing structured, semi-structured, and unstructured data.
What are the 3 Vs of big data?
Volume, Velocity, Variety
What is Volume in terms of data?
Volume is the enormous amount of data that is available for collection and produced by a variety of sources
What is velocity?
The speed at which data is generated - today often real time and near real time.
What is variety?
How much is structured, unstructured, or semi-structured data are you processing?
What is veracity?
The trustworthiness of your data
What is variability?
How often does the meaning of the collected data change, the collection method change?
What is value?
What is the business value of the data you collect?
What is the difference between structured, semi-structured and unstructured data?
Structured data is data in spreadsheets or relational databases. Unstructured data are things like text, images, audio, visual, and semi-structured are things like sensor data that cannot be organized in fixed data schema.
What are the components of big data architecture
Data sources
Data storage
Batch processing
Real-time message ingestion
Stream processing
Analytical data store
Analysis and reporting
Orchestration
What are data sources?
One or more data source like - application data stores, static files produced by applications, and real time data sources like IoT devices
What is data storage?
A distributed file store that can hold high volumes of files in various formats - a data lake
What is batch processing?
Because data sets are so large - big data solutions must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis.
What is real-time message ingestion
A way to capture and store real time messages and often act as a buffer to support scale out processes sing and reliable delivery
What is stream processing?
After real time messages are captured, stream processing filters, aggregates and prepares the data for analysis
What is the analytical data store?
Processed data must be served in a structured format that can be queried using analytical tools. The data store serves these queries either as a relational data warehouse or a bakehouse with medallion architecture.
What is analysis and reporting?
A way to empower the user to analyze the data - a data modeling or self service BI.
What is orchestration?
A way to automate the workflows that transform source data, move data between sources and sinks, load the processed data or push the results to a dashboard