Explore fundamentals of data analytics. Flashcards

1
Q

What is modern data warehousing?

A

Modern data warehousing architecture can vary, as can the specific technologies used to implement it; but in general, the following elements are included:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data ingestion and processing

A

data from one or more transactional data stores, files, real-time streams, or other sources is loaded into a data lake or a relational data warehouse. The load operation usually involves an extract, transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and restructured for analysis. In ETL processes, the data is transformed before being loaded into an analytical store, while in an ELT process the data is copied to the store and then transformed. Either way, the resulting data structure is optimized for analytical queries. The data processing is often performed by distributed systems that can process high volumes of data in parallel using multi-node clusters. Data ingestion includes both batch processing of static data and real-time processing of streaming data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Analytical data store

A

data stores for large scale analytics include relational data warehouses, file-system based data lakes, and hybrid architectures that combine features of data warehouses and data lakes (sometimes called data lakehouses or lake databases).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Analytical data model

A

while data analysts and data scientists can work with the data directly in the analytical data store, it’s common to create one or more data models that pre-aggregate the data to make it easier to produce reports, dashboards, and interactive visualizations. Often these data models are described as cubes, in which numeric data values are aggregated across one or more dimensions (for example, to determine total sales by product and region)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data visualization

A

data analysts consume data from analytical models, and directly from analytical stores to create reports, dashboards, and other visualizations. Additionally, users in an organization who may not be technology professionals might perform self-service data analysis and reporting. The visualizations from the data show trends, comparisons, and key performance indicators (KPIs) for a business or other organization, and can take the form of printed reports, graphs and charts in documents or PowerPoint presentations, web-based dashboards, and interactive environments in which users can explore data visually.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data ingestion and processing pipelines

A

On Azure, large-scale data ingestion is best implemented by creating pipelines that orchestrate ETL processes. You can create and run pipelines using Azure Data Factory, or you can use the same pipeline engine in Azure Synapse Analytics if you want to manage all of the components of your data warehousing solution in a unified workspace.

In either case, pipelines consist of one or more activities that operate on data. An input dataset provides the source data, and activities can be defined as a data flow that incrementally manipulates the data until an output dataset is produced. Pipelines use linked services to load and process data – enabling you to use the right technology for each step of the workflow. For example, you might use an Azure Blob Store linked service to ingest the input dataset, and then use services such as Azure SQL Database to run a stored procedure that looks up related data values, before running a data processing task on Azure Databricks or Azure HDInsight, or apply custom logic using an Azure Function. Finally, you can save the output dataset in a linked service such as Azure Synapse Analytics. Pipelines can also include some built-in activities, which don’t require a linked service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Analytical data stores

A

There are two common types of analytical data store.

Data warehouses
A data warehouse is a relational database in which the data is stored in a schema that is optimized for data analytics rather than transactional workloads.

Data lakes
A data lake is a file store, usually on a distributed file system for high performance data access. Technologies like Spark or Hadoop are often used to process queries on the stored files and return data for reporting and analytics. Data lakes are great for supporting a mix of structured, semi-structured, and even unstructured data that you want to analyze without the need for schema enforcement when the data is written to the store.

Hybrid approaches
You can use a hybrid approach that combines features of data lakes and data warehouses in a lake database or data lakehouse. The raw data is stored as files in a data lake, and a relational storage layer abstracts the underlying files and expose them as tables, which can be queried using SQL. SQL pools in Azure Synapse Analytics include PolyBase, which enables you to define external tables based on files in a datalake (and other sources) and query them using SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Azure Synapse Analytics

A

unified, end-to-end solution for large scale data analytics,iy btings together multiple tecnologies enabling you to combine data integrity and reliability of a scalable, high-performance SQL Server based relational data warehouse with the flexibility of a data lake and open-source Apache Spark.All Azure Synapse Analytics services can be managed through a single, interactive user interface called Azure Synapse Studio. Synapse Analytics is a great choice when you want to create a single, unified analytics solution on Azure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Azure data bricks

A

is an Azure implementation of the popular Databricks platform. Databricks is a comprehensive data analytics solution built on Apache Spark, and offers native SQL capabilities as well as workload-optimized Spark clusters for data analytics and data science.Due to its common use on multiple cloud platforms, you might want to consider using Azure Databricks as your analytical store if you want to use existing expertise with the platform or if you need to operate in a multi-cloud environment or support a cloud-portable solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Azure HDInsight

A

is an Azure service that supports multiple open-source data analytics cluster types.Not as user friendlyas Synapse and Databricks, but is a suitble option if your analytics solution relies on multiple open-source frameworks or if you need to migrate an existing on-premises Hadoop-based solution to the cloud.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Batch Processing vs Stream processing

A

–Batch processing, in which multiple data records are collected and stored before being processed together in a single operation.

–Stream processing, in which a source of data is constantly monitored and processed in real time as new data events occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Batch processing

A

In batch processing, newly arriving data elements are collected and stored, and the whole group is processed together as a batch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Advantages of batch processing

A

–Large volumes of data can be processed at a convenient time.

–It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Disadvantages of batch processing

A

–The time delay between ingesting the data and getting the results.

–All of a batch job’s input data must be ready before a batch can be processed. This means data must be carefully checked. Problems with data, errors, and program crashes that occur during batch jobs bring the whole process to a halt. The input data must be carefully checked before the job can be run again. Even minor data errors can prevent a batch job from running.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Understand stream processing

A

In stream processing, each new piece of data is processed when it arrives. Unlike batch processing, there’s no waiting until the next batch processing interval - data is processed as individual units in real-time rather than being processed a batch at a time. Stream data processing is beneficial in scenarios where new, dynamic data is generated on a continual basis.

Real world examples

–A financial institution tracks changes in the stock market in real time, computes value-at-risk, and automatically rebalances portfolios based on stock price movements.

–An online gaming company collects real-time data about player-game interactions, and feeds the data into its gaming platform. It then analyzes the data in real time, offers incentives and dynamic experiences to engage its players.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Understand differences between batch and streaming data

A

Data scope: Batch processing can process all the data in the dataset. Stream processing typically only has access to the most recent data received, or within a rolling time window (the last 30 seconds, for example).

Data size: Batch processing is suitable for handling large datasets efficiently. Stream processing is intended for individual records or micro batches consisting of few records.

Performance: Latency is the time taken for the data to be received and processed. The latency for batch processing is typically a few hours. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds.

Analysis: You typically use batch processing to perform complex analytics. Stream processing is used for simple response functions, aggregates, or calculations such as rolling averages.

17
Q

Combine batch and stream processing

A

Many large-scale analytics solutions include a mix of batch and stream processing, enabling both historical and real-time data analysis. It’s common for stream processing solutions to capture real-time data, process it by filtering or aggregating it, and present it through real-time dashboards and visualizations

  1. Data events from a streaming data source are captured in real-time.
  2. Data from other sources is ingested into a data store (often a data lake) for batch processing.
  3. If real-time analytics is not required, the captured streaming data is written to the data store for subsequent batch processing.
  4. When real-time analytics is required, a stream processing technology is used to prepare the streaming data for real-time analysis or visualization; often by filtering or aggregating the data over temporal windows.
  5. The non-streaming data is periodically batch processed to prepare it for analysis, and the results are persisted in an analytical data store (often referred to as a data warehouse) for historical analysis.
  6. The results of stream processing may also be persisted in the analytical data store to support historical analysis.
  7. Analytical and visualization tools are used to present and explore the real-time and historical data.
18
Q

A general architecture for stream processing

A
  1. An event generates some data. This might be a signal being emitted by a sensor, a social media message being posted, a log file entry being written, or any other occurrence that results in some digital data.
  2. The generated data is captured in a streaming source for processing. In simple cases, the source may be a folder in a cloud data store or a table in a database. In more robust streaming solutions, the source may be a “queue” that encapsulates logic to ensure that event data is processed in order and that each event is processed only once.
  3. The event data is processed, often by a perpetual query that operates on the event data to select data for specific types of events, project data values, or aggregate data values over temporal (time-based) periods (or windows) - for example, by counting the number of sensor emissions per minute.
  4. The results of the stream processing operation are written to an output (or sink), which may be a file, a database table, a real-time visual dashboard, or another queue for further processing by a subsequent downstream query.
19
Q

Real-time analytics in Azure

A