Introduction Flashcards
(46 cards)
Data
Found everywhere, generated at an unprecedented rate. Fundamental component of society
Data in our lives
- Smart Home Devices
- Fitness Trackers (Track Biometric data)
- Financial Transactions
Types of Data
Structured, Unstructured, Semi-Structured
Structured Data
Well organized, formattable and easily searchable ex. Financial Records. Usually stored in RDBMS or files like csv
Unstructured Data
Unorganized, Unformatted and different formats ex. Social Media Posts, Emails etc. Usually stored in file systems/CMS that preserve original structure
Semi-Structured Data
Combination. Type of unorganized or partially organized data which doesn’t follow a rigid format but still has some level of structure. Mix of fixed and variable fields. Can be found in XML or JSON files.
Qualitative Data examlpe
Gender, Nationality
Quantitative Data example
Height, Weight
Raw Data
Original Source of data, hard to use for analysis. Raw data may only need to be processed once
Processed Data
Data that is ready for analysis, processing can include merging, subsetting, transforming etc. All steps should be recorded.
Raw Data Example
ASCII files to Binary files that are machine generated, unformatted excel files, API responses.
Accuracy
The measure of data quality that ensures data is correct, free from errors, and represents the real-world value accurately.
Completeness
Indicates whether all required data is recorded or if some is missing/unavailable.
Consistency
Ensures uniformity across data. Examples of issues include partially modified records or dangling updates.
Timeliness
Refers to whether data is updated promptly to reflect the current state.
Believability
Assesses how trustworthy or credible the data is.
Interpretability
Reflects how easily the data can be understood by users.
Data Consolidation Process
- Moving data: Ensuring all data is gathered from different sources into a unified location.
- Making it consistent: Aligning data formats and resolving inconsistencies.
- Cleaning data: Removing errors, duplicates, and filling in missing values where possible.
Why is Data Consolidation Needed?
- Data is often stored in different formats, making integration difficult.
- Data is frequently inconsistent across sources, causing discrepancies.
- Data may be dirty due to:
- Internal inconsistencies (e.g., conflicting records).
- Missing values or blank fields.
- Potentially incorrect data caused by:
- Faulty instruments.
- Human or system errors.
- Transmission errors during data movement.
Disparate Data
Data is often stored in diverse locations and formats, which may include:
- Relational Databases: Used in operational systems for structured data.
- XML Files: Common in web services for hierarchical data.
- Desktop Databases: Such as Microsoft Access.
- Spreadsheets: Examples include Microsoft Excel.
- JSON: Popular for semi-structured or API-related data.
Challenges with Disparate Data
- Data may reside on different operating systems
- Databases may operate on varying hardware platforms
- Integration of such varied data requires specialized tools and processes.
Causes of Inconsistencies
- Faulty instruments, human/computer errors, transmission errors, or business requirements.
- Examples of discrepancies:
- Two plants use different part numbers for the same item.
- Systems using different formats for True/False (e.g.,
1/0
,T/F
,Y/N
).
Other Data Quality Issues
- Free-form text entry:
- Example: Same city entered as Louisville, Lewisville, and Luisville.
- Cleaning routines must handle variations in bad data.
- Incomplete Data: Missing attribute values or only containing aggregate data.
- Example: Occupation = “” or Gender = “Unknown”.
- Noisy Data: Contains noise, errors, or outliers.
- Example: Salary = -10 (invalid).
Causes of Missing Data
- Equipment malfunction.
- Deleted data due to inconsistencies.
- Misunderstandings or assumptions during data entry.
- Data not considered important at the time.
- Lack of historical records or change tracking.