Core Data Cocepts Flashcards

1
Q

What is Data?

A

Data is a collection of facts such as numbers, descriptions, and observations used to record information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Structured Data

A

Structured data is data that adheres to a fixed schema, all of the data has the same fields or properties. The data is represented in one or more tables that consist of rows to represent each instance of a data entity, and columns to represent attributes of the entity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Semi-structured data

A

Semi-structured data is information that has some structure, but which allows for some variation between entity instances. Eg. while most customers may have an email address, some might have multiple email addresses, and some might have none at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unstructured data

A

Not all data is structured or even semi-structured. For example, documents, images, audio and video data, and binary files might not have a specific structure. This kind of data is referred to as unstructured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data stores

A

There are two broad categories of data store in common use:

  • File stores
  • Databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Delimited text files

A

Data is often stored in plain text format with specific field delimiters and row terminators. The most common format for delimited data is comma-separated values (CSV) in which fields are separated by commas and rows. Another popular format is tab-separated values(TSV).
Delimited text is a good choice for structured data that needs to be accessed by a wide range of applications and services in a human-readable format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

JavaScript Object Notation (JSON)

A

JSON is a ubiquitous format in which a hierarchical document schema is used to define data entities (objects) that have multiple attributes. Each attribute might be an object (or a collection of objects); making JSON a flexible format that’s good for both structured and
semi-structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Extensible Markup Language (XML)

A

XML is a human-readable data format that was popular in the 1990s and 2000s. It’s largely been superseded by the less verbose JSON format, but there are still some systems that use XML to represent data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Binary Large Object (BLOB)

A

Ultimately, all files are stored as binary data (1’s and 0’s), but in the human-readable formats discussed above, the bytes of binary data are mapped to printable characters (typically though a character encoding scheme such as ASCII or Unicode).

Common types of data stored as binary include images, video, audio, and application-specific documents.

When working with data like this, data professionals often refer to the data files as BLOBs (Binary Large Objects).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Optimized file formats

A

Specialized file formats that enable compression, indexing, and efficient storage and processing have been developed because human-readable formats are not optimized for storage and processing.

Avro – Is a row-based format created by Apache. Each record contains a header that describes the structure of the record, the header is stored as JSON. Data is stored as binary. Avro is a good file format for compressing and minimizing storage and network bandwidth requirements.

ORC – (Optimized Row Columnar format) organizes data into columns rather than rows. Developed by HortonWorks for optimizing read and write operations in Apache Hive (Hive is a data warehouse system that supports fast data summarization and querying over large datasets). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.

Parquet – Parquet is another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data .A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Relational databases

A

Relational databases are commonly used to store and query structured data. The data is stored in tables that represent entities, such as customers, products, or sales orders. Each instance of an entity is assigned a primary key that uniquely identifies it, and these keys are used to reference the entity instance in other tables. This use of keys to reference data entities enables a relational database to be normalized; which in part means the elimination of duplicate data values so that, for example, the details of an individual customer are stored only once; not for each sales order the customer places.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Non-relational databases

A

Non-relational databases are data management systems that don’t apply a relational schema to the data. Non-relational databases are often referred to as NoSQL database, even though some support a variant of the SQL language.

There are four common types of Non-relational database commonly in use.

  • Key-value databases in which each record consists of a unique key and an associated value, which can be in any format.
  • Document databases, which are a specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)
  • Column family databases, which store tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. Each column family holds a set of columns that are logically related together.
  • Graph databases, which store entities as nodes with links to define relationships between them.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Transactional data workloads

A

A transactional data processing system is what most people consider the primary function of business computing. A transactional system records transactions that encapsulate specific events that the organization wants to track. A transaction could be financial, such as the movement of money between accounts in a banking system.

Transactional systems are often high-volume, sometimes handling many millions of transactions in a single day. The data being processed has to be accessible very quickly. The work performed by transactional systems is often referred to as Online Transactional Processing (OLTP).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Online Transactional Processing (OLTP).

A

OLTP solutions rely on a database system in which data storage is optimized for both read and write operations in order to support transactional workloads in which data records are created, retrieved, updated, and deleted (often referred to as CRUD operations).These operations are applied transactionally, in a way that ensures the integrity of the data stored in the database. To accomplish this, OLTP systems enforce transactions that support so-called ACID semantics:

– Atomicity – each transaction is treated as a single unit, which succeeds completely or fails completely. EG. For example, a transaction that involved debiting funds from one account and crediting the same amount to another account must complete both actions. If either action can’t be completed, then the other action must fail.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Atomicity

A

– Atomicity – each transaction is treated as a single unit, which succeeds completely or fails completely. EG. For example, a transaction that involved debiting funds from one account and crediting the same amount to another account must complete both actions. If either action can’t be completed, then the other action must fail.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Consistency

A

Consistency – transactions can only take the data in the database from one valid state to another. To continue the debit and credit example above, the completed state of the transaction must reflect the transfer of funds from one account to the other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Isolation

A

Isolation – concurrent transactions cannot interfere with one another, and must result in a consistent database state. For example, while the transaction to transfer funds from one account to another is in-process, another transaction that checks the balance of these accounts must return consistent results - the balance-checking transaction can’t retrieve a value for one account that reflects the balance before the transfer, and a value for the other account that reflects the balance after the transfer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Durability

A

Durability – when a transaction has been committed, it will remain committed. After the account transfer transaction has completed, the revised account balances are persisted so that even if the database system were to be switched off, the committed transaction would be reflected when it is switched on again.

19
Q

Analytical data workloads

A

Analytical data processing typically uses read-only (or read-mostly) systems that store vast volumes of historical data or business metrics. Analytics can be based on a snapshot of the data at a given point in time, or a series of snapshots.

20
Q

Architecture for enterprise-scale analytics

A

The specific details for an analytical processing system can vary between solutions, but a common architecture for enterprise-scale analytics looks like this:

  1. Data files may be stored in a central data lake for analysis.
  2. An extract, transform, and load (ETL) process copies data from files and OLTP databases into a data warehouse that is optimized for read activity. Commonly, a data warehouse schema is based on fact tables that contain numeric values you want to analyze (for example, sales amounts), with related dimension tables that represent the entities by which you want to measure them (for example, customer or product)
  3. Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. For example, sales revenue might be totaled by date, customer, and product.
  4. The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.
21
Q

Data lakes

A

Data lakes are common in modern data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.

22
Q

Data warehouses

A

Data warehouses are an established way to store data in a relational schema that is optimized for read operations – primarily queries to support reporting and data visualization. The data warehouse schema may require some denormalization of data in an OLTP data source (introducing some duplication to make queries perform faster).

23
Q

OLAP

A

An OLAP model is an aggregated type of data storage that is optimized for analytical workloads. Data aggregations are across dimensions at different levels, enabling you to drill up/down to view aggregations at multiple hierarchical levels; for example to find total sales by region, by city, or for an individual address. Because OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.

24
Q

Data Scientists

A

Data scientists might work directly with data files in a data lake to explore and model data.

25
Q

Data Anylists

A

Data Analysts might query tables directly in the data warehouse to produce complex reports and visualizations.

26
Q

Business users

A

Business users might consume pre-aggregated data in an analytical model in the form of reports or dashboards.

27
Q

Data Professionals Roles

A

There’s a wide variety of roles involved in managing, controlling, and using data. Some roles are business-oriented, some involve more engineering, some focus on research, and some are hybrid roles that combine different aspects of data management.

28
Q

Database administrators

A

Database administrators manage databases, assigning permissions to users, storing backup copies of data and restore data in the event of a failure. A database administrator is responsible for the design, implementation, maintenance, and operational aspects of on-premises and cloud-based database systems. They’re responsible for the overall availability and consistent performance and optimizations of databases. They work with stakeholders to implement policies, tools, and processes for backup and recovery plans to recover following a natural disaster or human-made error.

The database administrator is also responsible for managing the security of the data in the database, granting privileges over the data, granting or denying access to users as appropriate.

29
Q

Data engineers

A

A data engineer collaborates with stakeholders to design and implement data-related workloads, including data ingestion pipelines, cleansing and transformation activities, and data stores for analytical workloads. They use a wide range of data platform technologies, including relational and non-relational databases, file stores, and data streams.

They’re also responsible for ensuring that the privacy of data is maintained within the cloud and spanning from on-premises to the cloud data stores. They own the management and monitoring of data pipelines to ensure that data loads perform as expected.

30
Q

Data analysts

A

A data analyst enables businesses to maximize the value of their data assets. They’re responsible for exploring data to identify trends and relationships, designing and building analytical models, and enabling advanced analytics capabilities through reports and visualizations.

A data analyst processes raw data into relevant insights based on identified business requirements to deliver relevant insights.

31
Q

Microsoft cloud services for data

A

Microsoft Azure is a cloud platform that powers the applications and IT infrastructure for some of the world’s largest organizations. It includes many services to support cloud solutions, including transactional and analytical data workloads.

32
Q

Azure SQL

A

Azure SQL is the collective name for a family of relational database solutions based on the Microsoft SQL Server database engine. Specific Azure SQL services include:

– Azure SQL Database – a fully managed platform-as-a-service (PaaS) database hosted in Azure

– Azure SQL Managed Instance – a hosted instance of SQL Server with automated maintenance, which allows more flexible configuration than Azure SQL DB but with more administrative responsibility for the owner.

– Azure SQL VM – a virtual machine with an installation of SQL Server, allowing maximum configurability with full management responsibility

33
Q

Azure Database for open-source relational databases

A

– Azure Database for MySQL - a simple-to-use open-source database management system that is commonly used in Linux, Apache, MySQL, and PHP (LAMP) stack apps.

– Azure Database for MariaDB - a newer database management system, created by the original developers of MySQL. The database engine has since been rewritten and optimized to improve performance. MariaDB offers compatibility with Oracle Database (another popular commercial database management system).

– Azure Database for PostgreSQL - a hybrid relational-object database. You can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties.

34
Q

Azure Cosmos DB

A

Azure Cosmos DB is a global-scale non-relational (NoSQL) database system that supports multiple application programming interfaces (APIs), enabling you to store and manage data as JSON documents, key-value pairs, column-families, and graphs.

35
Q

Azure storage

A

Azure Storage is a core Azure service that enables you to store data in:

–Blob containers - scalable, cost-effective storage for binary files.

–File shares - network file shares such as you typically find in corporate networks.

–Tables - key-value storage for applications that need to read and write data values quickly.

Data engineers use Azure Storage to host data lakes - blob storage with a hierarchical namespace that enables files to be organized in folders in a distributed file system.

36
Q

Azure Data Factory

A

Azure Data Factory is an Azure service that enables you to define and schedule data pipelines to transfer and transform data. You can integrate your pipelines with other Azure services, enabling you to ingest data from cloud data stores, process the data using cloud-based compute, and persist the results in another data store.

Azure Data Factory is used by data engineers to build extract, transform, and load (ETL) solutions that populate analytical data stores with data from transactional systems across the organization.

37
Q

Azure Synapse Analytics

A

Azure Synapse Analytics is a comprehensive, unified data analytics solution that provides a single service interface for multiple analytical capabilities, including:

–Pipelines - based on the same technology as Azure Data Factory.

–SQL - a highly scalable SQL database engine, optimized for data warehouse workloads.

–Apache Spark - an open-source distributed data processing system that supports multiple programming languages and APIs, including Java, Scala, Python, and SQL.

–Azure Synapse Data Explorer - a high-performance data analytics solution that is optimized for real-time querying of log and telemetry data using Kusto Query Language (KQL).

Data engineers can use Azure Synapse Analytics to create a unified data analytics solution that combines data ingestion pipelines, data warehouse storage, and data lake storage through a single service.

Data analysts can use SQL and Spark pools through interactive notebooks to explore and analyze data, and take advantage of integration with services such as Azure Machine Learning and Microsoft Power BI to create data models and extract insights from the data.

38
Q

Azure Databricks

A

Azure Databricks is an Azure-integrated version of the popular Databricks platform, which combines the Apache Spark data processing platform with SQL database semantics and an integrated management interface to enable large-scale data analytics.

Data engineers can use existing Databricks and Spark skills to create analytical data stores in Azure Databricks.

Data Analysts can use the native notebook support in Azure Databricks to query and visualize data in an easy to use web-based interface.

39
Q

Azure HDInsight

A

Azure HDInsight is an Azure service that provides Azure-hosted clusters for popular Apache open-source big data processing technologies, including:

–Apache Spark - a distributed data processing system that supports multiple programming languages and APIs, including Java, Scala, Python, and SQL.

–Apache Hadoop - a distributed system that uses MapReduce jobs to process large volumes of data efficiently across multiple cluster nodes. MapReduce jobs can be written in Java or abstracted by interfaces such as Apache Hive - a SQL-based API that runs on Hadoop.

–Apache HBase - an open-source system for large-scale NoSQL data storage and querying.

–Apache Kafka - a message broker for data stream processing.

–Apache Storm - an open-source system for real-time data processing through a topology of spouts and bolts.

Data engineers can use Azure HDInsight to support big data analytics workloads that depend on multiple open-source technologies.

40
Q

Azure Stream Analytics

A

Stream Analytics is a real-time stream processing engine that captures a stream of data from an input, applies a query to extract and manipulate data from the input stream, and writes the results to an output for analysis or further processing.

Data engineers can incorporate Azure Stream Analytics into data analytics architectures that capture streaming data for ingestion into an analytical data store or for real-time visualization.

41
Q

Azure Data Explorer

A

Azure Data Explorer is a standalone service that offers the same high-performance querying of log and telemetry data as the Azure Synapse Data Explorer runtime in Azure Synapse Analytics.

Data analysts can use Azure Data Explorer to query and analyze data that includes a timestamp attribute, such as is typically found in log files and Internet-of-things (IoT) telemetry data.

42
Q

Azure Purview

A

Azure Purview provides a solution for enterprise-wide data governance and discoverability. You can use Azure Purview to create a map of your data and track data lineage across multiple data sources and systems, enabling you to find trustworthy data for analysis and reporting.

Data engineers can use Azure Purview to enforce data governance across the enterprise and ensure the integrity of data used to support analytical workloads.

43
Q

Microsoft Power BI

A

Microsoft Power BI is a platform for analytical data modeling and reporting that data analysts can use to create and share interactive data visualizations. Power BI reports can be created by using the Power BI Desktop application, and the published and delivered through web-based reports and apps in the Power BI service, as well as in the Power BI mobile app.