Big Data Flashcards
(32 cards)
What is continuous data?
Numerical data with infinite options such as age or weight.
What is discrete data?
Numerical data with finite options such as shoe size.
What is ordinal data?
Categorical data that is hierarchical in nature such as pain severity or mood
What is nominal data?
Non hierarchical categorical data such as eye colour or dog breed.
What is Data Aggregation?
Combining all the data together in a uniform format
What is Normalization?
Scaling data into a regularised range so it can be compared more accurately.
What is Apache Kafka and what are its uses?
Apache Kafka is a distributed messaging system for real time data streams, it is used for real-time data pipelines, log aggregation and stream processing.
What is Apache Flume and what are its uses?
Apache Flume collects, aggregates and moves large amounts of log data from multiple sources to a centralised data store, it is primarily used for log data collection.
What is AWS Kinesis and what are its uses?
AWS Kinesis is a real time data streaming service for ingesting and processing data streams and it is used real-time event tracking and IoT data ingestion
What are the pros and cons of Kafka?
Pros: Scalable, low latency, fault tolerant
Cons: requires setup, complex to manage.
What are the pros and cons of Flume?
Pros: simple, strong Hadoop integration
Cons: Limited to log data, not real-time
What are the pros and cons of AWS Kinesis?
Pros: fully manages, AWS integration
Cons: AWS dependent, costs may rise
What is a data warehouse?
A centralised repository for storing structured data. Optimised for querying and reporting on large scale datasets
What are the uses of a data warehouse?
Business intelligence, reporting, analytics applications and historical analysis
What are the pros and cons of a data warehouse?
Pros: fast queries on structured data
Cons: expensive, less flexible for unstructured data
What is a data lake?
A centralised repository for storing unstructured, semi structured and structured data in raw format, supports later processing.
What are the uses of a data lake?
Machine Learning, Big Data analytics
What are the pros and cons of a data lake?
Pros: Cost Effective, stores all types of data
Cons: lack of governance, complex for analysis
What is a data lakehouse?
Modern data architecture combining features of a data lake and data warehouse allowing for efficient storing and processing of both structured and unstructured data
What are the uses of a data lakehouse?
Predictive analytics, AI, large scale data processing.
What are the pros and cons of a data lakehouse?
Pros: combines flexibility and governance
Cons: still evolving, potential integration challenges
What is Hadoop Distributed File System (HDFS)?
A primary storage system used by Hadoop applications which allows for distributed storage and processing of large datasets across clusters of computers.
What is the architecture of HDFS?
Files are split into large blocks (Default 128 or 64MB) and distributed across multiple machines, each block is replicated across multiple nodes for fault tolerance
What is HDFS used for?
Scalable storage of unstructured data such as logs, media or large datasets in scientific research