Lecture 8 Flashcards

1
Q

Big Data

A

Data that cannot be stored and processed on a singled device.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 aspects of big data

A
  • Distributed storage (Distributed File Systems / Sharing)
  • Distributed processing (and handling derived data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Processing on Big-Data

A

Not as easy as writing an SQL Query and expecting fast results
- Exploration
- Analytics
- Processing
- Publishing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Database architecture

A

Used by management to monitor business performance.

  • Dashboard are built once in software based on information needs.
  1. View
    (Rest API / SDK)
  2. Controller
    (Database API / SQL)
  3. Model / DB
    (Database API / SQL)
  4. Power BI, etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data warehousing

A

Collecting data for reporting purposes.

  • Make static snapshots to send to a central data warehouse.
  • Extract, transform, load (ETL)
  • Staging - preparing data for reporting an integration.
  • Takes load off operational systems.
  • Enriches information by combining systems.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Database Architecture

A
  1. Buy bigger machines
  • Effectiveness of upgrading hardware is limited and expensive
  • Single point of failure
  1. Buy more machines
  • Create replicas for instances
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Processing

A

ETL to Big Data

  1. Relational databases
  2. ETL
  3. Big Data
  4. Cloud Solutions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

HDFS

A

Hadoop Distributed File System

  • Storage layer for Hadoop BigData System
  • Based on Google File system
  • Fault tolerant distributed file system
  • Designed to turn a computing cluster (a large collection of loosely connected compute nodes) into a massively scalable pool of storage.
    -Provides redundant storage for massive amounts of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Properties of HDFS

A
  • Made to be resilient and fail proof, when each data node writes its memory to disk data blocks, it also replicates that memory to another server.
  • Data nodes can be made rack aware, since redundancy does not work when you write data to two disk drives in the same rack.
  • The name node tells the data nodes where to write data.
  • The name node also tells your application which data nodes hold the file.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

HBASE Column Family

A

Column families give way to optimal sharding and compression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Streaming Data

A

Imagine have to process incoming messages from:

A mmorpg where players are moving around, finding gold and loot.
Uber drivers all over a country moving around.

We need real-time processing of information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Apache Kafka

A

Functions like a distributed publish-subscribe messaging system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Apache Kafka features

A
  • Durability
  • Scalability
  • High availability
  • High throughput (scalable managing system)
  • Distributed, reliable publish-subscribe system
  • Design as message queue and implementation as a distributed log service.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Batch processing

A

Processing of blocks of data that have already been stored over a period of time.
- Often on disk.
- Hadoop and MapReduce

e.g processing transactions that have been performed by a financial firm in a week.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Stream

A

Process data in real-time as they arrive and detect conditions within a small period of time from the point of receiving the data.

  • Often in memory.
  • Multiple publishers.
  • Concurrency
  • Kafka and Spark Streaming

e.g fraud detection, social media sentiment analysis, log monitoring, analysing customer behaviour.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data Exploration

A
  • Data exploration is about describing the data by means of statistical and visualisation techniques.
  • We explore in order to understand the features and bring important features to our models.
17
Q

Data Exploration with big data

A
  • We cannot load all data in memory
  • Some operations take too much time to run on a single machine.
18
Q

Exploring using Pandas

A
  • Pandas is an implementation of the DataFrame data structure.
  • PySpark uses the same data structure to distribute computation in a cluster.
    Dask provides another distributed DataFrame alternative.
    -
19
Q

When to use DataFrame?

A

The Pandas DataFrame is a structure that contains two-dimensional data and its corresponding labels. DataFrames are widely used in data science, machine learning, scientific computing, and many other data-intensive fields. DataFrames are similar to SQL tables or the spreadsheets that you work with in Excel or Calc

20
Q

Why use DataFrame?

A

DataFrame allows to store heterogeneous data while Series allows to store homogeneous data.

21
Q

Distributed DataFrame

A
  1. Original DataFrame
  2. Split (the data frame in in-memory manageable chunks)
  3. Apply (the transformation to each chunk independently)
  4. Combine (each chunk back into a data frame)