12. Manipulating Data Flashcards

(167 cards)

1
Q

Define Data integrity

A

refers to how reliable of a set of data is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define data governance

A

systems and procedures for ensuing data is reliable and secure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data integrity has a number of key components.

What are they

A
  1. Accuracy
  2. Consistency
  3. Completeness
  4. Reliability
  5. Security
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain Accuracy in Data integrity

A

data should be correct and free from errors or inconsistencies.

For example, a customer’s email
address should be correct and functional.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain Consistency in Data integrity

A

data should follow set rules and formats.

For example, phone numbers should follow the same format,

This consistency ensures that the data doesn’t contradict itself or provide conflicting or misleading information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain in Completeness Data integrity

A

all necessary data should be included with no missing data points.

For example if a data set intended to analyse company sales performance did not contain the number sold for some items, it would be inaccurate and could lead to incorrect decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain Reliability in Data integrity

A

users should be able to trust the data.

It should be available when needed, and

its reliability should be ensured through measures such as backups, security protocols and validation processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain Security in Data integrity

A

data integrity also includes the security measures used to prevent unauthorised
access, modification, or deletion of data in order to safeguard data from tampering or
corruption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is validation

A

techniques and processes designed to ensure data is reasonable and follows specific rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some key tools and practices for maintaining data integrity?

A
  1. Data validation
  2. data backups
  3. data access controls
  4. audit trails.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is data validation?

A

The process of checking data for possible errors and inconsistencies before entering it into a system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Data backup involve in

A

involves regularly creating copies of the data to protect it from loss or corruption,
by ensuring data can be restored in the case of system error or failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is data backup important?

A

it protects data from loss or corruption by allowing restoration in case of system error or failure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What do data access controls do?

A

They limit who can access and modify data to prevent unauthorised changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Name three methods for controlling access to data.

A

preventing physical access
using security settings
encryption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of an audit trail?

A

To track changes made to data and by whom, providing accountability and transparency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are common elements included in an audit trail?

A

Timestamp, username, action performed, affected data element, and source of change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does the “source of change” in an audit trail refer to?

A

Information about the system or application where the action originated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DATA DICTIONARIES

What does the principle ‘secure by design’ mean in IT?

A

It means planning systems and data with security and integrity in mind from the start.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why is it important to consider data integrity at the planning stage?

A

It helps ensure that data integrity can be more easily established and maintained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a data dictionary?

A

blueprint or plan for a database that details how data is structured and stored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What kind of information does a data dictionary provide?

A

Metadata about data elements, including structure, definitions, relationships, and attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is metadata

A

details or information that describe other data to provide deeper meaning or understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does a data dictionary help maintain data integrity?

A

By ensuring all data complies with required structure and definitions within a system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
What are common components found in a data dictionary?
table name field name description, data type, length/size validation rules, constraints.
23
What does a data dictionary help with in database management?
It helps users understand the structure and properties of data, enabling effective management, querying, consistency, and integrity.
24
What is a field name in a data dictionary?
The identifier for a specific category of data within a table (e.g., 'email address').
25
What is a table name in a data dictionary?
The unique name of a specific table within a database.
26
Why is the 'description' field important in a data dictionary?
It provides clarity about what the field represents, especially when abbreviations or variations are used.
27
What does 'data type' indicate in a data dictionary?
The way data is stored, which defines what operations can be performed (e.g., integer vs. string).
28
What are constraints in a data dictionary?
Additional rules like primary key, not null, or data relationships that affect how data is handled.
29
SOME COMMON DATA TYPES USED IN SQL DATABASES
LOOK AT PG 5 OF TEXTBOOK
30
DATA VALIDATION How does reducing errors in stored data affect data integrity?
It helps improve data integrity by ensuring the data is more accurate and reliable.
31
What is the goal of data validation?
To reduce the chance of errors by ensuring data follows pre-set rules.
32
What are the three typical phases of data validation?
Input Check, Post-check Action.
32
What happens in the 'input' phase of data validation?
Data is entered into the system, either by a user or through automated means.
33
What happens in the 'check' phase of data validation?
The system compares the entered data against predefined rules.
34
What is a typical 'post-check action' if data validation fails?
The system displays an error message and prompts the user to re-enter the data.
35
What is a typical 'post-check action' if data validation succeeds?
The system proceeds to the next item and may confirm the data is accepted (e.g., with a green tick).
35
Define data migration
extracting data from an existing dataset and transferring it to another
36
seven main types of Data validation rules
1. Presence check 2. Range check 3. Lookup 4. List 5. Length 6. Format check 7. Check digit
37
What is a presence check in data validation?
It ensures that a required field is not left blank. eg:- Ensuring the customer's name and delivery address are entered in an online purchase form.
38
What is a range check in data validation?
It ensures that input falls within a defined range of values. eg:- Checking that parking time entered is between 1 and 6 hours.
39
What is a lookup check in data validation?
It checks entered data against an external reference table or list. eg:- Verifying a product code matches an existing code in the stock table.
40
What is a list check in data validation?
It ensures the data entered matches one of a small set of acceptable options.
41
What does a length check validate?
That the number of characters in input meets specified min and/or max limits. eg:- Ensuring a password is at least 8 characters long.
42
What is a check digit in data validation?
A number added to the end of a numeric string, used to detect input or transmission errors.
43
How does a check digit help in validation?
The system recalculates it using the same algorithm to verify data accuracy.
44
RELATIONAL DATABASES What is a relational database?
A database that stores data in structured tables and creates relationships between data in separate tables.
45
Why are relational databases useful?
They help organize large amounts of related data efficiently and reduce redundancy.
46
How does a relational database help an e-commerce site?
It allows separate tables for customers, products, suppliers, and sales, and links data as needed (e.g., during a purchase).
47
What is data redundancy?
the unnecessary replication of data in a database
48
What are three major problems caused by data redundancy?
Increased storage costs data inconsistency maintenance implications.
49
How does data redundancy increase storage costs?
Repeatedly storing the same data uses more space, which can be costly with large datasets.
50
How does data redundancy lead to data inconsistency?
If updates to the same data are made in one location but not others, mismatches can occur. eg:- A customer’s address is updated in one department’s database but not in others, causing confusion.
51
DATA NORMALISATION What is data normalisation?
A process that reduces redundancy by breaking complex tables into smaller, simpler ones and establishing relationships.
52
What is the main goal of data normalisation?
to eliminate repeated data and improve data integrity by organizing it efficiently.
53
What are the three main stages of normalisation called?
First normal form (1NF) second normal form (2NF), third normal form (3NF).
54
FIRST NORMAL FORM The aim of the 1NF
to remove any repeating groups of data in individual tables, and ensure that each column represents a single attribute
55
When creating a table in first normal form, your table should meet these requirements:
Each cell should contain single, indivisible values. All entries in a column must be of the same data type. Each column should have a unique name. A primary key should be used to identify each record.
56
SECOND NORMAL FORM What is second normal form (2NF)?
A stage of normalisation where data that applies to multiple records is placed into separate tables.
57
When creating a table in second normal form your table should meet these requirements:
First normal form must have been satisfied. There are no partial dependencies.
58
THIRD NORMAL FORM Aim of 3NF
In third normal form any data in a row that doesn't depend on a key should be eliminated and placed in its own table.
59
To be in third normal form the data must meet these requirements:
The requirements for second normal form must have been satisfied . Transitive dependencies have been removed.
60
LOGICAL DATA MODELS What is a logical data model?
A plan for how data elements in a database are structured and how they relate to each other.
61
What is the purpose of a logical data model?
To provide a high-level view of how data is organized, focusing on entities, attributes, and relationships.
62
How is a logical data model used in database planning?
It helps plan the database structure before implementation, often using an ERD and data dictionary.
63
What tools are commonly used alongside a logical data model in database planning?
Entity-relationship diagrams (ERDs) and data dictionaries.
64
key components of a logical data model,
Entities, attributes, and relationships.Unique identifiers Constraints
65
Describe Entities
These represent the real-world objects or concepts that are important to the business Each entity is a single table in a relational database. When defining each entity it is important to ensure that the structure of the data is normalised as far as possible.
66
Describe Attributes
These are the characteristics or properties that describe entities (e.g. customer name, product price, order date). These are the fields in your database. Remember that when building your data model,your attribute (field) in each entity (table) should be atomic.
67
Describe Relationships
These define how entities are connected to each other and how the data and tables will interact when the database is deployed (e.g. a customer places an order, an order contains products). These are the one-to-one, one-to-many, and many-to-one relationships that you are familiar with
68
Describe Unique identifiers
Each field in a table should have a unique identifier (name) that allows it to be easily identified (e.g., customer ID, product ID, order ID). Avoid ambiguity when selecting identifiers.
69
Describe Constraints
Rules that enforce data integrity, such as primary keys, foreign keys, and referential integrity.
70
BIG DATA What is Big Data?
Very large and complex data sets that are difficult to manage using traditional data-processing tools.
71
What makes Big Data valuable to organisations?
It can provide vital insights into various aspects of operations.
72
Why do many organisations prioritise using Big Data?
Because it helps them gain insights that can improve decision-making and efficiency.
73
THE FIVE Vs What is the 5 Vs
five key concepts when dealing with Big Data,
74
The 5Vs
volume, velocity, variety, veracity and value.
75
Describe volume
Volume refers to the amount of data. Sources such as social media and business transactions all generate data. The volume of data has massive implications for organisations in terms of storage and processing requirements.
76
Describe velocity
Velocity is the speed at which data is generated
77
What does "Variety" in Big Data refer to?
The different types and formats of data from various sources.
78
The different types and formats of data from various sources.
Well-organised data typically stored in relational databases, easy to query and analyse.
79
What is structured data?
well-organised data, which is much easier to work with and analys
80
What is unstructured data?
Data that lacks a predefined format, such as text files, images, videos, or social media content.
80
What is semi-structured data?
Data that is not fully structured but includes some organisational properties like metadata (e.g. JSON, XML).
81
What does "Veracity" in Big Data refer to?
The quality, accuracy, and trustworthiness of the data.
82
Give an example of a low-veracity data source.
Social media posts, due to user bias and lack of regulation.
83
How can organisations improve data veracity?
By verifying data sources, cleaning data, and applying quality checks during processing.
84
What does "Value" refer to in the context of Big Data?
The meaningful insights and benefits an organisation gains from analysing Big Data.
85
Give an example of how Big Data adds value to a business.
By identifying consumer trends that allow for more targeted marketing and product development.
86
INFRASTRUCTURE AND SERVICES FOR BIG DATA Define infrastructure:
hardware and software resources that support data and processes
87
three main groups of infrastructure:
Collection Storage Transmission.
88
COLLECTION What does the "Collection" aspect of Big Data refer to?
The process of gathering data from a wide range of sources and devices.
89
Name common sources of Big Data.
Social media IoT devices, business transactions. Web and app logs Surveillance systems: Machine-generated data: Publicly available data
90
What is loT (internet of things):
objects through the internet to allow the sending and receiving of data
91
STORAGE Why do many companies prefer cloud services over 'in-house' storage?
Because cloud services offer scalable, pay-as-you-go storage and system resources.
92
Name three popular cloud services used for Big Data storage and processing.
Amazon Web Services (AWS) Microsoft Azure Google Cloud Platform (GCP).
93
What additional feature do many cloud providers offer with their storage services?
Tools to manipulate and analyse Big Data sets.
94
Common storage systems include
Distributed file systems: NoSQL databases Data warehouses
95
What are distributed file systems used for in Big Data storage?
Storing large volumes of data across multiple nodes for scalability and fault tolerance.
96
What is the role of NoSQL databases in Big Data?
Handling unstructured or semi-structured data with flexibility and scalability.
97
What is a data warehouse in the context of Big Data?
A central storage point for structured and semi-structured data optimized for analytics.
98
What is a key benefit of storage systems that sit on top of cloud infrastructure?
They can often be used across different cloud services, offering flexibility and portability.
99
Transmission What does "transmission" refer to in the context of Big Data?
: The moving or transfer of data in and between systems.
99
What is a key issue in Big Data transmission?
Transferring large data sets while maintaining data quality and accessibility.
99
What are the infrastructure solutions can be used to support data transmission.
1. High-bandwidth networks: fibre optic cables and dedicated network connections offer high bandwidth to accommodate large data volumes. 2. Distributed storage: distributing data across multiple storage nodes reduces network strain and improves accessibility. 3. Data compression: compressing data before transmission reduces file size and bandwidth requirements. 4. Cloud-based solutions: storing and accessing data on a cloud storage platform reduces the amount of data that needs to be downloaded to the user's computer.
99
What is silo
a collection of data used and controlled by one group, department or business that is not accessible by the rest of the organisation
100
IMPACT OF STORING BIG DATА What are the impact of storing big data
Access Processing time Transmission time Security
101
Access has 3 sections. What are they
Data silos On-demand access Cross-border data access
102
Why are data silos a challenge for Big Data?
Because integrating data from different silos is complex and needs effective data integration strategies.
103
What are data silos?
Isolated data storage areas across different departments or systems.
103
What is on-demand access in the context of Big Data?
The ability for users to access data whenever and wherever they need it.
104
What does data access depend on within a business?
The point in the business and analysis process where the action occurs.
105
How do organisations ensure on-demand data access?
By implementing systems with reliable access, often using cloud services with limited downtime and backup.
106
What is a major issue with cross-border data access?
It may face legal and regulatory challenges due to international data-sharing laws.
107
What must organisations do to ensure compliance in cross-border data access?
Navigate complex regulatory environments.
108
Processing time has 2 sections. What are they
Real-time access Processing speed
109
Why is timely data processing important?
To ensure users can use data effectively and maintain productivity.
110
What is real-time access in Big Data?
The ability to access and use data instantly as it's generated or updated.
111
Why is achieving real-time access to Big Data challenging?
Because it requires low-latency systems and often significant technology investment.
112
What technologies may support real-time data processing?
High-powered local computers and servers, and cloud services from third-party vendors.
113
What can slow down processing large data sets even with high-speed hardware?
The volume of calculations involved, which can exceed processing limits.
114
: How can systems reduce the perceived processing time for users needing real-time data?
By predicting user needs and performing much of the data processing ahead of time.
115
What is a key challenge in data transmission for Big Data systems?
Ensuring timely transmission of data between systems.
116
Why is low latency important in Big Data transmission?
Because it minimizes delays experienced by the user, especially in real-time or time-critical applications.
117
Security has 2 sections. What are they
Data privacу Data security
118
What does data privacy protect in Big Data?
Sensitive information and the privacy rights of individuals.
119
What is essential to safeguard data security?
Implementing strong access controls, encryption, and other security measures.
120
3 concepts that have important roles in managing, processing and deriving insights from Big Data.
Data mining data warehousing data analytics
121
DATA MINING What is data mining?
The process of extracting meaningful information from data.
122
What statistical techniques are commonly used in data mining?
Clustering, classification, regression, and anomaly detection.
123
What technologies are often used to assist in data mining?
Machine learning and artificial intelligence.
124
DATA WAREHOUSHING What does data warehousing involve?
The collection, storage, and management of raw data from different sources.
125
What is a data warehouse?
Data warehousing involves the collection, storage and management of the raw data from a range of different sources.
125
What is combined in a data warehouse?
Data from an organisation's own data silos and external data.
126
What methods are used in data analytics to gain insights from data?
Data analysis and statistical methods.
126
DATA ANALYTICS What is Data analytics
Data analytics takes the information derived from data mining and turns it into actionable intelligence for an organisation.
127
What are some common functions of data analytics?
1. producing reports, dashboards, charts and graphs 2. natural language search, text analysis 3. machine learning 4. statistical analysis 5. modeling and simulations 6. data import and export to other software 7. data security.
128
three main strands to data analytics:
Descriptive analytics Predictive analytics Prescriptive analytics
129
What is descriptive analytics?
The process of summarising historical data to gain insights into what has happened.
130
How does diagnostic analytics relate to descriptive analytics?
: It helps understand why certain events occurred, aiding in improving good practices or avoiding errors.
131
What is predictive analytics?
The use of historical data to predict future outcomes and trends by identifying causes and effects.
132
Give an example of predictive analytics.
Predicting a shortage of a particular food type based on weather patterns and the resulting price increase.
133
What is prescriptive analytics?
The process of suggesting actions required to optimise results once a trend or outcome is predicted.
134
How does prescriptive analytics assist organisations?
It suggests actions to mitigate negative outcomes, such as buying more supplies or finding alternative suppliers.
135
THE USE OF BIG DATA BY INDIVIDUALS, ORGANISATIONS AND SOCIETY What are they
**Healthcare** 1. Personalised treatment plans 2. Accelerate drug discovery and development 3. Track and manage public health threats **Infrastructure planning** 1. Predictive maintenance 2. Traffic management and urban planning 3. Utility infrastructure optimisation **Transportation** 1. Public transportation optimisation 2. Fuel efficiency and emission reduction 3. Safety management **Fraud detection** 1. Pattern recognition 2. Behaviour analysis 3. Real-time monitoring
136
HEALTHCARE How has Big Data transformed healthcare in recent years?
It allows for the analysis of large data sets to discover patterns and anomalies, significantly impacting doctors, patients, treatments, and technologies.
137
What is the role of Big Data in personalised treatment plans?
By analyzing patient data and comparing it with data from other patients, medical professionals can predict disease risks and tailor treatments to improve patient outcomes.
138
How does Big Data accelerate drug discovery and development?
By analyzing genetic and clinical data, doctors can identify trends in effective treatments, allowing for targeted or combined approaches to create new treatments.
139
How is Big Data used to track and manage public health threats?
Healthcare professionals and government agencies use Big Data to monitor disease outbreaks and risk factors, allowing for more accurate analyses and informed public health interventions.
140
INFRASTRUCTURE PLANNING How does Big Data improve town and city infrastructure?
By enhancing the quality and efficiency of services such as roads, public transport, and utilities like electricity.
141
How is Big Data used for predictive maintenance in infrastructure?
It analyzes data from sensors, monitoring equipment, and historical data to predict when maintenance or replacement of systems is needed, allowing for proactive action to prevent failures.
142
What is the benefit of predictive maintenance for infrastructure?
It reduces downtime and extends the lifespan of critical infrastructure components.
143
How is Big Data used in traffic management and urban planning?
By analyzing traffic patterns, collecting real-time data from sensors and GPS devices, and optimizing transportation networks to inform decisions on road design, public transportation routes, and traffic signal timings.
144
How does Big Data help in utility infrastructure optimisation?
By analyzing data from sensors and systems to optimize the efficiency and quality of utilities, such as predicting fluctuations in energy and water demand, and planning for distribution, storage, and maintenance.
145
What factors are analyzed to optimise utility infrastructure using Big Data?
Usage patterns, consumption levels, system performance, and weather forecasts.
146
TRANSPORTATION How is Big Data used to improve transportation?
By optimizing road planning and public transportation, and improving fuel efficiency, emissions, and safety management.
147
What benefits come from optimizing public transportation with Big Data?
Improved service reliability, better alignment with passenger needs, and increased public transport usage.
148
How does Big Data contribute to safety management in transportation?
By analyzing data related to accidents, near misses, and safety incidents to identify risk factors and implement preventative measures.
148
How is Big Data used to improve fuel efficiency and reduce emissions?
By monitoring fuel consumption, vehicle performance, and emissions to help fleet operators optimize fuel use, reduce carbon emissions, and comply with environmental regulations.
149
FRAUD DETECTION How does Big Data help in detecting fraud?
By enabling organizations to analyze vast and diverse datasets to identify patterns, anomalies, and suspicious activities that may indicate fraudulent behavior.
149
What is done to ensure safety measures are effective in transportation?
the safety measures are continuously monitored and refined based on data analysis.
149
How does pattern recognition contribute to fraud detection?
By identifying patterns of normal behavior and detecting deviations that may signal fraudulent activities. Machine learning algorithms are used to recognize these patterns.
150
How does behavior analysis help detect fraud?
By analyzing user behavior and transaction patterns to identify deviations from normal behavior, such as unusually expensive purchases or transactions in different countries, which may indicate fraud.
151
How do banks use Big Data to track fraudulent activities?
Banks track transaction patterns and compare purchases against normal behavior. If a transaction deviates from typical patterns, the bank may contact the account holder to verify the activity.
152
What is the benefit of real-time monitoring in fraud detection?
It enables organizations to identify and respond to potential fraudulent activities quickly, minimizing financial losses for both the organization and its customers.