Big Data Flashcards by Harry Savill

What is continuous data?

Numerical data with infinite options such as age or weight.

How well did you know this?

Not at all

Perfectly

What is discrete data?

Numerical data with finite options such as shoe size.

How well did you know this?

Not at all

Perfectly

What is ordinal data?

Categorical data that is hierarchical in nature such as pain severity or mood

How well did you know this?

Not at all

Perfectly

What is nominal data?

Non hierarchical categorical data such as eye colour or dog breed.

How well did you know this?

Not at all

Perfectly

What is Data Aggregation?

Combining all the data together in a uniform format

How well did you know this?

Not at all

Perfectly

What is Normalization?

Scaling data into a regularised range so it can be compared more accurately.

How well did you know this?

Not at all

Perfectly

What is Apache Kafka and what are its uses?

Apache Kafka is a distributed messaging system for real time data streams, it is used for real-time data pipelines, log aggregation and stream processing.

How well did you know this?

Not at all

Perfectly

What is Apache Flume and what are its uses?

Apache Flume collects, aggregates and moves large amounts of log data from multiple sources to a centralised data store, it is primarily used for log data collection.

How well did you know this?

Not at all

Perfectly

What is AWS Kinesis and what are its uses?

AWS Kinesis is a real time data streaming service for ingesting and processing data streams and it is used real-time event tracking and IoT data ingestion

How well did you know this?

Not at all

Perfectly

What are the pros and cons of Kafka?

Pros: Scalable, low latency, fault tolerant
Cons: requires setup, complex to manage.

How well did you know this?

Not at all

Perfectly

What are the pros and cons of Flume?

Pros: simple, strong Hadoop integration
Cons: Limited to log data, not real-time

How well did you know this?

Not at all

Perfectly

What are the pros and cons of AWS Kinesis?

Pros: fully manages, AWS integration
Cons: AWS dependent, costs may rise

How well did you know this?

Not at all

Perfectly

What is a data warehouse?

A centralised repository for storing structured data. Optimised for querying and reporting on large scale datasets

How well did you know this?

Not at all

Perfectly

What are the uses of a data warehouse?

Business intelligence, reporting, analytics applications and historical analysis

How well did you know this?

Not at all

Perfectly

What are the pros and cons of a data warehouse?

Pros: fast queries on structured data
Cons: expensive, less flexible for unstructured data

How well did you know this?

Not at all

Perfectly

What is a data lake?

Study These Flashcards

A centralised repository for storing unstructured, semi structured and structured data in raw format, supports later processing.

What are the uses of a data lake?

Study These Flashcards

Machine Learning, Big Data analytics

What are the pros and cons of a data lake?

Study These Flashcards

Pros: Cost Effective, stores all types of data
Cons: lack of governance, complex for analysis

What is a data lakehouse?

Study These Flashcards

Modern data architecture combining features of a data lake and data warehouse allowing for efficient storing and processing of both structured and unstructured data

What are the uses of a data lakehouse?

Study These Flashcards

Predictive analytics, AI, large scale data processing.

What are the pros and cons of a data lakehouse?

Study These Flashcards

Pros: combines flexibility and governance
Cons: still evolving, potential integration challenges

What is Hadoop Distributed File System (HDFS)?

Study These Flashcards

A primary storage system used by Hadoop applications which allows for distributed storage and processing of large datasets across clusters of computers.

What is the architecture of HDFS?

Study These Flashcards

Files are split into large blocks (Default 128 or 64MB) and distributed across multiple machines, each block is replicated across multiple nodes for fault tolerance

What is HDFS used for?

Study These Flashcards

Scalable storage of unstructured data such as logs, media or large datasets in scientific research

How does HDFS compare to traditional storage?

HDFS is optimised for write-once, read-many operations and offers large scale and high fault-tolerance.

What is MongoDB?

A NoSQL document-oriented database that stores data in flexible JSON-like documents. It's suited for applications requiring hierarchical or complex structures like a CMS.

What are the pros and cons of MongoDB?

Pros: schema flexibility, easy to scale Cons: Not optimized for heavy write workloads

What is Cassandra?

A distributed wide-column store designed for high availability and scalability across multiple nodes, often used in IoT apps.

What are the Pros and Cons of Cassandra?

Pros: highly available, optimized for writes Cons: Limited complex querying, eventual consistency

What is Hadoop MapReduce?

A programming model for processing large datasets in parallel across a distributed cluster.

Where is Hadoop MapReduce used?

processing log files, web indexing, and aggregating large-scale data in parallel.

Big Data Flashcards

(32 cards)