Big Data Flashcards

(32 cards)

1
Q

What is continuous data?

A

Numerical data with infinite options such as age or weight.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is discrete data?

A

Numerical data with finite options such as shoe size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is ordinal data?

A

Categorical data that is hierarchical in nature such as pain severity or mood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is nominal data?

A

Non hierarchical categorical data such as eye colour or dog breed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Data Aggregation?

A

Combining all the data together in a uniform format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Normalization?

A

Scaling data into a regularised range so it can be compared more accurately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Apache Kafka and what are its uses?

A

Apache Kafka is a distributed messaging system for real time data streams, it is used for real-time data pipelines, log aggregation and stream processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Apache Flume and what are its uses?

A

Apache Flume collects, aggregates and moves large amounts of log data from multiple sources to a centralised data store, it is primarily used for log data collection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is AWS Kinesis and what are its uses?

A

AWS Kinesis is a real time data streaming service for ingesting and processing data streams and it is used real-time event tracking and IoT data ingestion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the pros and cons of Kafka?

A

Pros: Scalable, low latency, fault tolerant
Cons: requires setup, complex to manage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the pros and cons of Flume?

A

Pros: simple, strong Hadoop integration
Cons: Limited to log data, not real-time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the pros and cons of AWS Kinesis?

A

Pros: fully manages, AWS integration
Cons: AWS dependent, costs may rise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a data warehouse?

A

A centralised repository for storing structured data. Optimised for querying and reporting on large scale datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the uses of a data warehouse?

A

Business intelligence, reporting, analytics applications and historical analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the pros and cons of a data warehouse?

A

Pros: fast queries on structured data
Cons: expensive, less flexible for unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a data lake?

A

A centralised repository for storing unstructured, semi structured and structured data in raw format, supports later processing.

17
Q

What are the uses of a data lake?

A

Machine Learning, Big Data analytics

18
Q

What are the pros and cons of a data lake?

A

Pros: Cost Effective, stores all types of data
Cons: lack of governance, complex for analysis

19
Q

What is a data lakehouse?

A

Modern data architecture combining features of a data lake and data warehouse allowing for efficient storing and processing of both structured and unstructured data

20
Q

What are the uses of a data lakehouse?

A

Predictive analytics, AI, large scale data processing.

21
Q

What are the pros and cons of a data lakehouse?

A

Pros: combines flexibility and governance
Cons: still evolving, potential integration challenges

22
Q

What is Hadoop Distributed File System (HDFS)?

A

A primary storage system used by Hadoop applications which allows for distributed storage and processing of large datasets across clusters of computers.

23
Q

What is the architecture of HDFS?

A

Files are split into large blocks (Default 128 or 64MB) and distributed across multiple machines, each block is replicated across multiple nodes for fault tolerance

24
Q

What is HDFS used for?

A

Scalable storage of unstructured data such as logs, media or large datasets in scientific research

25
How does HDFS compare to traditional storage?
HDFS is optimised for write-once, read-many operations and offers large scale and high fault-tolerance.
26
What is MongoDB?
A NoSQL document-oriented database that stores data in flexible JSON-like documents. It's suited for applications requiring hierarchical or complex structures like a CMS.
27
What are the pros and cons of MongoDB?
Pros: schema flexibility, easy to scale Cons: Not optimized for heavy write workloads
28
What is Cassandra?
A distributed wide-column store designed for high availability and scalability across multiple nodes, often used in IoT apps.
29
What are the Pros and Cons of Cassandra?
Pros: highly available, optimized for writes Cons: Limited complex querying, eventual consistency
30
What is Hadoop MapReduce?
A programming model for processing large datasets in parallel across a distributed cluster.
31
Where is Hadoop MapReduce used?
processing log files, web indexing, and aggregating large-scale data in parallel.
32