Data Analysis Tools Flashcards

1
Q

Purpose and Outputs of data integration activities

A

Integration begins with the ingestion process, and includes steps such as cleansing, ETL mapping and transformation.

Data integration ultimately enables analytics tools
to produce effective, actionable business intelligence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Functional Requirements

A

Defines a system or its component
Describes functions a software must perform
A function is nothing but inputs, it’s behaviour and outputs

Calculation / data manipulation/ business process / user interaction

Help you capture the intended behaviour of the system

Behaviour may be expressed as functions, services or tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Non-functional requirement

A

Defined the quality attribute of a software system

They represent a set of standard used to judge the specific operation of a system

E.g how fast does the website load

Is essential to ensure the use ability and effectiveness of the entire software system

Allow you to impose constrains or restrictions on the design of the system across the various agile backlogs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Speed of Data Integration

A

Faster for data warehouses than relational databases because the access is only write-only.

Data warehouses allow large analytical queries which eliminate the issue by accessing transactional databases (OLTP)

Loading data into warehouses (ETL) is usually carried out in batches and during the loading data the warehouse is unavailable.

Having data integrated into a single source saves time (doesn’t take as long to prepare and analyse the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Integration: Structure and Rules

A
  • Security policies
  • Access Layers
  • Should be immutable (not be able to change the content of the integrated data destination)
  • Validation checks should be carried out during ETL
  • Validate the source and target table structure, data types
  • Validation checks should be carried out during ETL
  • Validate the source and target table structure, data types
  • Validate the column names against a mapping doc
  • Verification done using ETL testing
  • Verify that the data is accurate
  • Verify the data is the right data required to be in the data warehouse
  • Verify that dat has no duplicated in the data warehouse
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Rationale for using and integrating data from multiple sources

A
  • Consistency

missing data and identifying gaps that need to be filled - through ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Benefits of integrating data from multiple sources

A
  • Increased collaborations across teams
  • Better business intelligence and insights
  • Data availability to all stakeholders
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

To consider: Data in a business context

A
  1. Variability: illustrates how things differ, and by how much
  2. Uncertainty: Good visualisation practices frame uncertainty that arises from variation in data
  3. Context: Meaningful context helps us frame uncertainty against underlying variation in data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Descriptive Analysis

A

Looks at past data and tells what happened. Often used when tracking KPI, revenue, sales

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Diagnostic Analysis

A

Aims to determine why something happened

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Predictive Analysis

A

predicts what is likely to happen in the future

Trends are derived from past data and used to create predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Prescriptive Analysis

A

Combines the information found from the previous 3 types of data analysis and forms a plan of action for the organisation to face the issue or decision

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data Warehouse

A

An industry standard tool that allows the collection and organisation of data origination from various sources, is the data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reasons for using data from multiple sources

A
  • allows the analyst to pull together data to resolve issues
  • perform data analysis
  • obtain a 360 view of a problem
  • assist with decision making among others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Quality of Data Sources - thins that could go wrong

A

Data Truncation - prevision is lost/wrong data types
Data Corruption - commas in the wrong place
Data Missing - only a portion of the data set is uploaded
Data Typing - wrong data type uploaded into field
Data Translation - Wrong encoding/symbols

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

to ensure better quality of data:

A
CheckSum
Spot Checks (Eyeballing)
Mins/Max/Aggregates
Counts
Export Comparison
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Data Integration

A

the process of combining data from different sources to help data managers and executives analyse it and make smarter business decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Data Integration Process involves:

A

A person or system location, retrieving, cleaning, and presenting the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Manual Data Integration

A

Occurs when a data manager oversees all aspects of the integration - usually by writing custom code. without automation
Best for one-time instances - untenable for complex or recurring integrations because it is tedious manual process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Manual Data Integration: Benefits

A

Reduced cost: requires little maintenance and typically only integrates a small number of data sources
Greater Freedom: user has total control over the integration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Manual Data Integration: Limitations

A

Less access: A developer or manager must manually orchestrate each integration
Difficulty scaling: Scaling for larger projects requires manually changing the code of each integration, and that takes time
Greater room for error: A manager/analyst must handle the data at each stage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Middleware data integration

A

Software that connects applications and transfers data between them and databases. Middleware can act as an interpreter between old and new systems
Mostly it is a communication tool and has limited capabilities for data analytics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Middleware data integration: Benefits

A

Better data streaming: the software conducts the integration automatically
Easier access between systems: software is coded to facilitate communication between the systems in a network

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Middleware data integration: Limitations

A

Less access: middleware needs to be developed and maintained by a developer with technical knowledge
Limited functionality: can only work with certain systems

25
Application-based integration
Software applications do all the work: locate, retrieve, clean, integrate data from disparate sources. Easy for data to move from one source to the other
26
Application-based integration: Benefits
Simplified processes: One application does all the work automatically Easier information exchange: the application allows systems and departments to transfer information seamlessly Fewer resources are used: because much fo the process is automated, analysts can pursue other projects
27
Application-based integration: Limitations
Limited access: requires technical knowledge and a analyst to oversee application maintenance/deployment Inconsistent results: Approach is unstandardised and varies from businesses offering this as a service Complicated set-up: designing requires developers/analysts etc... with technical knowledge Difficult data management: accessing different systems can lead to compromised data integrity
28
Uniform access integration
Accesses data from even more disparate sets and presents it uniformly, while allowing the data to stay in its original location
29
Uniform access integration: Benefits
Lower storage requirements Easier data access (works well with multiple data sources/systems) Simplified view of data: uniformed appearance of data for end user
30
Uniform access integration: Limitations
Data integrity challenges: lots of sources can lead to compromising data integrity Strained systems: Data host systems are not usually designed to handle the amount/frequency of data requests
31
Common storage integration: (data warehousing)
Similar to uniform access - involves creating and storing a copy of the data in a data warehouse. More versatility
32
Common storage integration (data warehousing): Benefits
Reduced burden: host system isn't constantly handling data queries Increased data version management control: one source = better data integrity Cleaner data appearance Enhanced data analytics: maintaining a stored copy allows more sophisticated queries
33
Common storage integration (data warehousing): Limitations
Increased storage costs: creating a copy = paying for storage Higher maintenance costs =
34
WHEN TO USE IT: | Manual Data Integration
Merge data for basic analysis between a small amount of data sources
35
WHEN TO USE IT: | Middleware Data Integration
Automate and translate communication between legacy and modernised systems
36
WHEN TO USE IT: | Application-based Data Integration
Automate and translate communication between systems and allow for more complicated data analysis
37
WHEN TO USE IT: | Uniform-based Data Integration
Automate and translate communication between systems and present the data uniformly to allow for complicated data analysis
38
WHEN TO USE IT: | Common storage Data Integration
Present the data uniformly, create and store a copy and perform the most sophisticated data analysis
39
Common user interface
****
40
Virtual Integration | or Virtual Integration
leaves the data in the source systems and defined a set of views to provide and access the unified view to the customer across the whole enterprise pros: nearly zero latency of the data updates cons: limited possibility of datas history and version management + apply to simialr
41
Physical Data Integration: ETL Extract:
The process of reading multiple data sources into the Data Warehouse, COPIED (not moved). Data validation occurs during this stage, you must have the correct: - structure - format - permissions Method of extraction - for each data source, define whether the extraction process is manual or tool based.
42
Physical Data Integration: ETL Transform:
The process of combining the data tables or linking them using relational databases. The data itself is not changed in any way. Data engineers must consider the efficiency of the databases as well as ensuring that all necessary data can be accessed. Before moving the data into a single point of reference we need to: - remove inconsistencies - Standardise various data elements - Make sure of the meanings of the data names in each file - Deduplication - Deriving new calculated values - Data validation
43
Physical Data Integration: ETL Load:
The process of writing the data to the target database Due to the nature of Big Data = necessary to use parallel processing to manage the volume of data being written to the system. Data Verification is undertaken post-loading to ensure the data is accurate.
44
Sequence
It is the order we want the compute to execute the instructions we provide as programmers.
45
Selection
Selecting which path of an algorithm to execute depending on some criteria.
46
Iteration
Refers to looping or repeating procedures.
47
SIMPLE QUERY ``` SELECT FROM WHERE ORDER BY HAVING LIMIT DISTINCT ```
``` SELECT column, names, FROM table-name WHERE condition / filter on rows ORDER BY sort-order HAVING filters / sorts / arranges on groupby LIMIT restricts number of rows DISTINCT brings back unique values ```
48
Union RUles
Union = multiple tables with a single query 1. Must match the number of columns, compatible data types 2. Can only have one ORDER BY at the bottom of the full select statement 3. UNION = removes exact duplicates where UNION ALL allows for duplicates 4. Conditions between UNION SELECT statements should match
49
SQL Expressions: CASE
groups data into categories or classifications
50
SQL Expressions: DATETIME
type of variable that allows storing a date/time value in the database YYYY-MM-DD HH:MI:SS
51
SQL Expressions: Compound
mathematical operations within the relational database: arithmetic, comparison, logistical, string
52
Arithmetic
+ - * / ** (addition, subtraction, multiplication, division, to the power of)
53
Comparison
= Equal to != or <> Not equal to < Less than <= or !> Less than or equal to (not greater than) > Greater than >= or !< Greater than or equal to (not less than)
54
Logistical
ALL / AND Does the value meet ALL criteria ANY/ OR Does the value meet ANY criteria BETWEEN Is the value BETWEEN listed values EXISTS Does a row meeting the criteria Exist IN Is the value found in the listed literal values LIKE Compares the value to listed values using wildcard NOT reverses the meaning of a logical operator IS NULL checks if value is null UNIQUE searches for duplicates
55
String
CHAR(n) returns the character at index n CONCAT concatenates items (puts together) FORMAT(n) Returns a number formatted to n decimal places LOWER() returns the argument in lowercase UPPER() returns the argument in uppercase REPEAT() repeats the string a certain number of times TRIM() removes leading and trailing spaces
56
Functions ``` AVG COUNT MAX MIN GROUPBY ROUND CAST CONVERT ISNULL ```
``` AVG = average value COUNT = number of rows MAX = largest number MIN = smallest number GROUPBY = indicates dimensions to group data by on aggregates ROUND = specifies number of decimal places CAST = changes data type of an expression (temp) CONVERT = converts the data type of a value (perm) ISNULL = returns a specified value if the expression is null, if not null returns the expression ```
57
JOINS ``` INNER OUTER FULL(OUTER) RIGHT LEFT UNION SELECT INTO SUBQURIES EXCEPTION CROSS ```
INNER = joins only what matches in both tables OUTER = Combination of a left join and a right join FULL(OUTER)= RIGHT = includes everything from the right table and anything that matches in the left LEFT = includes everything from the left table and anything that matches in the right UN|ON = two SELECT queries they union rows SELECT INTO = copies data into from one table into a new table SUBQUERIES: Queries within another SQL query and embedded within the WHERE clause, condition to further restrict EXCEPTION (and outer) help to handle missing data between tables CROSS JOIN - not desired / cartesian / slow performance
58
Implicit Data Conversion
server automatically converts the data from one type to another during the query processs
59
Data Profiling
Review to determine Accuracy Completeness Validity