Chapter 2 - Mastering the Data Flashcards
(66 cards)
Cleaning data
generally means filling out missing values/finding out rows with empty values or checking for duplicate records, etc.;
data transformation
converting data from one format to another. Likely moving from the format of the source system to the format of the destination system.
4 benefits of data preparation
- catch errors before the actual process begins
- produce better quality of data
- processing better data = better insights
- better insights = better decisions
What are the 4 characteristics of a data analytics mindset?
- Asking the right questions
- ETL (extract, transform, and load) relevant data
- Apply appropriate DA techniques
- Interpret and share results with stakeholders
A.E.A.I
A delimiter(sometimes known as a field separator)
is a sequence of one or more characters specifying the boundary between distinct data attributes. For example, if we write a name as “Smith, David,” then the comma delimits, or separates, the first and last names. Any combination of characters can be used as delimiters, but the most common are a comma, tab, space, colon and pipe (which is a vertical line, typed as |).
Proprietary file types
There are many different proprietary file types, but the most commonly used are .xls or .xlsx, the file types for saving Microsoft Excel documents. When proprietary file types save files, they use(underlying)coding to distinguish the rows and columns.
–Strength: The program “gets it right” when putting the data in the correct columns and rows.
–Weakness: Proprietary file types often cannot be opened in other software and the amount of records they hold can be restricted. For example, Excel files (currently) can hold approximately 1millionrows of data.
CSV
(comma-separated values)
TSV
(tab-separated values)
pipe
(|)
XBRL
eXtensible Business Reporting Language
is a freely available and global framework for exchanging business information. XBRL allows the expression of semantic meaning commonly required in business reporting.
text qualifier
When a delimiter (such as a comma) is used as a legitimate part of the text, you need to tell the program that it actually is part of the text. To do this, you need to add text qualifiers to the text. The most common text qualifier is the double quotation marks (“”).
Understand the importance of unique identifiers when dealing with data, and how they can save you time and prevent errors.
example: location number + employee ID numbers unique identifiers
5 main types of join
- inner join
- left join
- right join
- full outer join
- cross join
Inner join
The inner join(sometimes just called a join) combines data in two tables that matches one or moreidentified attribute. Importantly, an inner join will not pull the data from the tables if there is no match ofthe identified attribute
Left join - merging/joining data
The left join combines all data from the table listed on the left and only data that matches the identified attributes from the right table.
Right Join - merging/joining data
the right join functions similar to the left join, except it keeps all data in the right table and only merges matching data from the left table. A left join and right join will produce the exact same results if you switch which tables are listed on the left or right
full outer join
A full outer join returns all values from both tables when they match on a specified dimension, and then returns all values that do not match on that dimension with a null value for the non-matching fields.
Cross join (or Cartersian product)
A cross join (or Cartesian product) does not use any variable to match, rather it pairs every single instance in one table with every other instance of the other table
Aggregation
Aggregation is the level at which the data is summarized. It can be at a low level (no aggregation is used)or at a high level (data is aggregated into a single number).
Data formats
Data can be formatted in many ways. A format specifies how the data should be treated. Common formats include treating data as a number, text, percent, scientific notation, etc. It is important to understand all of the different formats used in data and what they mean.
Similarly, different data formats often don’t “speak with each other.” If a unique identifier of 1731 is listed as a number in one data set but the identical number 1731 is listed as a text string in another data set, the tables will not merge correctly until the formats are the same. Each program processes formats differently, so making sure you understand how your program deals with formats is important.
Common messy data problems
- Data formats
- Dates
- Duplicate and redundant data
- Units of measurement
- International differences
6.
Creating a repeatable ETL process
- Data format
- Data scope
- Documentation
- Automation
Mastering the data can also be described via the ETL process. The ETL process stands for __________.
- Enter, transform and load
- Enter, total and load
- Extract, transform and load
- Extract, total and load
3
Which of the following is not a common way that data will need to be cleaned after extraction and validation?
- Remove heading and subtotals
- Remove trailing zeros
- Correct inconsistencies across data
- Format negative numbers
2