big data Flashcards

(7 cards)

1
Q

Data generation cycle -

A

more data leading to better algorithms creating better user experience driving more users geenereating more data…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is big data, why is it important and how to handle-

A

What is Big Data?
Extremely large datasets (exabytes or more). Too big, fast, or complex for traditional tools to handle. Collected over time, making analysis difficult.

Why is Big Data Important? Reveals valuable insights, patterns, and trends. Improves decision-making and business strategies. Provides a competitive advantage. Optimises processes and helps understand complex systems. AI makes big data more accessible.

How to Handle Big Data? Use specialised tools and technologies. Apply advanced analytics techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What three Vs characterize big data? -

A

Volume: Refers to the massive size of data (terabytes, petabytes) - most representative feature.

Velocity: The speed at which data is generated and processed. Processing types: Batch: Processes large data chunks at intervals. Stream: Processes data in real-time as it arrives.

Variety: Different data types from various sources. Three categories: Structured: Organized, follows a schema and identifies patterns (e.g., financial records, sales data). Semi-structured: Some organisation of data with flexibility. Contains tags, metadata (e.g., XML, JSON, log files). Unstructured: No predefined structure, hardest to analyse (e.g., text, images, social media).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Other characteristics of big data -

A

veracity and value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are two data-processing approaches for analytics? -

A
  • ETL (extract, transform,load):
    • Extract → Collect raw data from sources.
    • Transform → Clean and standardize data.
    • Load → Store processed data in a data warehouse.
      Pros: Popular for structured data warehouses. Cons: Requires predefined data structure before loading.
  • ELT (extract, load, transform):
    • Extract → Collect raw data from sources.
    • Load → Store raw data in a data lake/warehouse.
    • Transform → Process data as needed later.
      Pros: Handles large-scale data. Cons: More complex, requires powerful storage and computing. Stores unlimited raw data.

ETL: Best for structured data, requires setup time and extra servers for transformations, suited for predefined analytics.

ELT: Handles all data types, faster, allows parallel loading & transformation, ideal for IoT sensor data streams.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Big data analytics cycle -

A

based on ELT: extracting data from multiple sources, data storage in landing zone (a distributed system), data transformation, data integration into particular analytics tasks. The point of ELT is that data is stored first and transformed later, allowing flexible and large-scale analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Advantages and disadvantages with big data analysis -

A

Pros: to promote a data-driven decision making. Data store from different sources including internet, online shopping sites, social media, databases, external third-part sources. Identification of the issues regarding systems and business processes in real time.

Cons: managing large volumes of data. Finding and fixing data quality issues. Dealing with data integration and preparation complexities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly