Extra Flashcards
Data Overview
-Data = Units of information
-Data Documents = Collective form in which data exists (Datasets, Databases, Datastores, Data Warehouses)
-Data Sets = Logical groupings of units of data that generally are closely related or share the same data structure
-Data Types = How single units of data are intented to be used
-Batch and Streaming Data = How do we move our data around?
-Relational and Non Relational = How do access, query and search our data?
-Data Modelling = How do we prepare and design our data?
-Schemas and Schemaless = How do we structure our data for search
-Data Integrity and Data Corruption = How do we trust our data?
-Normalized and Denormalized = How do we trade quality vs speed?
Schema
Schema = A formal language which describes the structure of data of a database.
A schema can define many different data structures that serve different purposes for a database. (relational databases)
-Tables, Fields, Relationships, Views, Indexes, Packages, Procedures, Functions, XML schemas, Queues, Triggers, Sequences, etc.
Schemaless
Schemaless = When the primary “cell” of a database can accept many types.
-This allows developers to forgo upfront data modelling
Common schemaless databases are:
-Key/Value
-Document
-Columns
–Wide Column
-Graph
Query & Querying
A query is a request for data results (reads) or to perform operations such as inserting, updating deleting data (writes)
Querying is the act of performing a query
Query language is a scripting or programming language designed as the format to submit request or actions to the database. (SQL, GraphSQL, Kusto, Gremlin, etc)
Relational Data
-Tables = A logical grouping of rows and columns
-Views = Result set of a stored query on data stored in memory
-Materialized Views = Is a result set of stored query on data stored on disk
-Indexes = A copy of data sorted by one or multiple columns for faster reads a at the cost of storage (improves the speed of reads)
-Constraints = Rules applied to writes, that ensure data integrity
-Triggers = A function that is trigger on specific database events
-Primary Key = One or multiple columns that uniquely identify a table in a row
-Foreign Key = A column which holds the value of PK from another key to establish a relationship
Row-store vs Column-store
-RS-
-Data is organized in rows
-Traditional relational databases are row-stores
-Good for general purpose databases
-Suited for Online Transaction Processing (OLTP)
-Great when needing all possible columns in a row
-Not the best at analytics or massive amounts of data
-CS-
-Data is organized by columns
-Faster at aggregating values for analytics
-NoSQL store or SQL-like-databases
-Great for vast amount of data
-Suited for Online Analytical Processing (OLAP)
-Great when you only need a few columns
Data Integrity & Data Corruption
Data Integrity ensures data is recorded exactly as intended (data quality)
-Have a well defined and documented data modelling
-Logical constraints
-Redundant and versions of your data
-Hash functions
Data Corruption is the act or state of data not being in the intended state
-Hardware failure
-Human error
-Malicious actors
Normalized vs Denormalized Data
Normalized is a schema design to store non-redundant and consistent data
-Data integrity is maintained
-Little to no redundant data
-Many tables
-Optimizes for storage of data
Denormalized combines data so that accessing data (querying) is fast
-Data integrity is not maintained
-Redundant data is common
-Fewer tables
-Excessive data, storage is less optimal
Pivot Table
Is a table of statistics that summarizes the data of a more extensive table from a: Database, Spreadsheet or Business Intelligence (BI) tool
-Are a technique in data processing
-Draw attention to useful information
-Leads to funding figures and facts quickly
Strongly Consistent vs Eventually Consistent
Data consistency is when data is being kept in two different places and whether the data exactly match or do not match
SC = Every time you request data, you can expect consistent data to be returned with X time (1sec) (never returns old data)
EC = When the request data you may get back incosistent data within 2 secs (whatever data is currently in the db, you may get new data or old data)
Synchronus Vs Asynchronous
Can refer to mechanism for data transmission or data replication
Synchronous = continuous stream of data that si synchronized by a timer or clock
-Can only access data one transfer is complete
-Guaranteed consistency of data return at time of access
-Slower access times
Asynchronous = continuous stream of data separated by start and stop bits (no guarantee of time)
-Can access data anytime but may return older version or empty placeholder
-Faster access times, not guarantee of consistency
Non Relational Data
A non-relational database stores data in a non-tabular form and will be optimized for different kinds of data-structures
Types of non-relational databases:
-Key/Value = Each value has a key, designed to scale, only simple lookups
-Document = Primary entity is a JSON-like data-structure called a document
-Columnar = Has a table-like structure but data is stored around columns instead of rows
-Graph = Data is represented with nodes and structures. Where relationships matter.
Data Warehouse
A relational datastored designed for analytic workloads, which generally column-oriented data-store.
Companies will have terabytes and milllions of rows of data and they need a fast way to be able to produce analytics reports.
-They can return queries very very fast even though they have vast amounts of data
-Are infrequently accessed meaning they aren’t intended for real-time reporting
-They need to consume data from a relational database on a regular basis
Data Mart
A data mart is a subset of a data warehouse.
-It will store under 100GB and has a single business focus
-Allows different departments to have control over their own dataset
-Generally designed to be read-only
-Increase the frequency at which data can be accessed
-The cost to query the data is much lower
Data Lakes
A data lake is a centralized storage repository that holds a vast amount of raw data (big data) in either a semi-structured or unstructured format.
Commonly accessed for data workloads such as: Visualizations, Real-time analytics, Machine Learning, On-premise data
Data Concepts
Data Mining is the extraction of patterns and knowledge from large amounts of data (not the extraction of data itself)
Data Wrangling is the process of transforming and mapping data from one “raw” data form into another format.
A Data Model organizes elements of data and standardizes how they relate to one another.
Data Modelling is a process used to define and analyze data requirements needed to support the business processes.
Data Analytics is concerned with examining, transforming, and arranging data so that you can extract and study useful information.
ETL vs ELT
ETL and ELT are used when you want to move data from one location to another, where the datastore/databases have different data structures so you need to transform the data for the target system.
ELT
-Loads data directly into a target system
-Used for scalable cloud structured and unstructured data sources
-Used for large amounts of data
-Provides data lake support
-Requires specialized skills to implement and maintain
-Support for unstructured data readily available
ETL
-Loads data first into a staging server and then into the target system
-Used for on-premises, relational and structured data
-Used for a small amounts of data
-Doesn’t provide data lake support
-Easy to implement
-Mostly supports relational data
Data Analytics Techniques
- Descriptive Analytics - What happened?
-Metrics ~ KPI & ROI
-Generating sales and financial reports
-Accurate, comprehensive, live-data and effective visualizations - Diagnostic Analytics - Why did it happen?
-Investigate descriptive metrics to determine root cause
-Find and isolate anomalies into its own datasets and apply techniques - Predictive Analytics - What will happen?
-Use historical data to predict trends or reoccurence
-Statistical and ML techniques applied - Prescriptive Analytics - How can we make it happen? What actions should we take?
-Goes a step futher than predictive and uses ML by ingesting hybrid data to predict future scenarios that are exploitable
-The result is prescriptive analytics that will highlight what you can now make happen. Prescriptive analytics is a combination of data, mathematical models, and various business rules to infer actions to influence future desired outcomes. - Cognitive Analytics - What if this happens?
-Using analytics to draw patterns to create what-if scenarios and what actions can be taken if those scenarios become reality
-is a type of analytics that involves using machine learning algorithms to analyze unstructured data such as text, images, and speech.
Azure Synapse Analytics
Synapse Analytics is a data warehouse and unified analytics platform.
-Build ETL/ELT processes = ingest data from more than 95 native connectors
-Integrated Apache Spark
-Use T-SQL queries on both you data warehouse and Spark engines
-Supports multiple languages = T-SQL, Python, Scala, Spark SQL, and .Net
-Integrated with Artificial Intelligence (AI) and Business Intelligence tools (BI) = Azure ML, Cognito Services and Power BI
Synapse SQL and pools
Synapse SQL is a distributed version of T-SQL designed for data warehouse workloads
-Extends T-SQL to address streaming and machine learning scenario
-Use built-in streaming capabilities to land data from cloud data sources into SQL tables
-Integrate AI with SQL by using ML models to score data using the T-SQL PREDICT function
-Offers:
Serverless = For unpredictable workloads use the always-avalable, serverless SQL endpoint
-Serverless SQL pool is a query service over the data in you data lake
Predictable = Create dedicated SQL pools to reserve processing power for data stored in SQL tables
-Dedicated SQL pool is a query service over the data in your data warehouse. The unit of scale is an abstraction of compute power that is known as a data warehouse (DWU)
Synapse Key Features
Apache Spark:
-Synapse deeply integrates with Apache Spark.
-Simplified resource model that frees you from having to worry about managing clusters
-Fast Spark start-up and aggressive autoscaling
Data Lake:
-Tables defined on files in the data lake are seamlessly consumed by either Spark or Hive
-SQL and Spark can directly explore and analyze Parquet, CSV, TSV, and JSON files stored in the data lake
-Fast, scalable data loading between SQL and Spark databases
Azure Data Lake (Gen2)
A data lake is a centralized data repository for unstructured and semi-structured data
-Intended to store vast amounts of data
-Generally use objects (blobs) or files as its storage medium
-Designed to handle petabytes of data and hundreds of gigabits of throughtput
-Data Lake Storage adds a “hierarchical namespace” to Blob Storage
-Collect: Pulling from various data sources
-Transform: Change or blend data into new semi-structured data using ETL/ETL engines
-Publish: Publish dataset to meta catalogs so analysts can quickly find useful data
-Distribution: Allow access to data to various programs or APIs
PolyBase
PolyBase is a data virtualization feature for SQL Server.
Enables your SQL Server instance to query data with T-SQL directly from:
-SQL Server
-Oracle
-Teradata
-MongoDB
-Hadoop clusters
-Cosmos DB
without separately installing client connection software.
-Allows you to join data from a SQL Server instance with external data
Azure Synapse Analytics - ELT
You can perform ELT using Synapse SQL in Azure Synapse.
-The fastest and most scalable way to load data is through PolyBase external tables and the COPY statement
-With PolyBase and the COPY statement, you can access external data stored in Azure Blob storage or Azure Data Lake Storage via the T-SQL language