converted csv file Flashcards
testing only
What are the five stages of the data processing lifecycle?
- Data ingestion and integration
- Data Processing
- Data Storage
- Data Analysis
- Reporting
What does the first stage of the data processing lifecycle involve?
- Collecting data from a variety of sources, transforming the data (if needed to match the target storage) and then storing in the target storage
- When data is integrated from multiple sources we have to be aware of data heterogeneity
What types of data make up heterogeneous data?
- XML files
- JSON Files
- Weblogs and Cookies
- SQL Queries
- Flat files
Why is heterogeneous data increasing?
Rise in technology increasing data produced
What is data heterogeneity?
- Data made up of different types, sources, structures or formats.
- May be in a structured, unstructured or semi-structured format
What is structured data?
- Conforms to a well-defined schema - schema on write
- Often tabular (rows are datapoints and columns are attributes)
Where is structured data stored?
- Relational Databases
- Data Warehouses
- Legacy Data Systems
What is semi-structured data?
- Scheme is not completely defined by a data model – no schema on write
- May be in the format of: HTML files, XML files or JSON files
- The size, order and contents of the elements can be different
- Often used with IoT devices
What formats might semi-structured data be in and where might they be used?
- HTML files
- XML files
- JSON files
- Often used in IoT devices
What is unstructured data?
- No formal description of schema - no schema on write
- Human readable, requires pre-processing for a computer to extract information.
What types of file formats make up unstructured data?
- Text files
- Image Files
*Video Files - Audio Files
What are data ingestion and integration frameworks?
- Often carried out as a single technical solution
- Been used in data warehouses for a long time
- Uses the ETL process
What is the ETL process?
- Extract: collects raw data, often in a structured format
- Transform: processes the data into a format matching the end destination
- Load: stores the transformed data into its new storage location
What might be carried out during the transformation process of ETL?
- Data cleaning
*Data enrichment - Feature engineering
Where is ETL traditionally used?
Batch processing
What other approaches can be used instead of ETL?
- IoT Hubs
- Digital Twins
- Data Pipeline Orchestrators
- Bulk import tools
- Data Streaming Platforms
What are the requirements for a data integration tool?
- Different protocols to support data collection from different sources
- Support for integration processes across different hardware and operating systems
- Scalability and adaptability
- Integrated capabilities for transformation operations (including fundamental and complex transformations)
- Security mechanisms to protect the data in the pipeline
- Visualisation of data flow (not necessary, but offered by many tools)
What is data variety?
The diversity of data types
What is data veracity?
The level of trust in the collected data
What is meant by data velocity?
Tthe speed of data generation and its movement
What are challenges with data integration and ingestion?
- The increasing variety and veracity of data
- Processing large amounts of data with high velocity
- New requirements due to increased interest in new technologies
- Cybersecurity – data needs to be secure, trusted and accountable
- Encryption to protect data during transport
How do data processing frameworks work?
- Distribute storage and processing over several nodes
- Transform data in several steps
- Can efficiently store, access and process large amounts of data
How is the transformation process generally modelled in a data processing framework?
As a Directed Acyclic Graph
How are Directed Acyclic Graphs used in data processing?
- Each stage has an input and an output
- Inputs can be used for multiple tasks so dependencies are clearly defined and there are no feedback loops