Chapter 2 - Mastering the Data Flashcards

1
Q

Cleaning data

A

generally means filling out missing values/finding out rows with empty values or checking for duplicate records, etc.;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

data transformation

A

converting data from one format to another. Likely moving from the format of the source system to the format of the destination system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

4 benefits of data preparation

A
  1. catch errors before the actual process begins
  2. produce better quality of data
  3. processing better data = better insights
  4. better insights = better decisions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the 4 characteristics of a data analytics mindset?

A
  1. Asking the right questions
  2. ETL (extract, transform, and load) relevant data
  3. Apply appropriate DA techniques
  4. Interpret and share results with stakeholders

A.E.A.I

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

A delimiter(sometimes known as a field separator)

A

is a sequence of one or more characters specifying the boundary between distinct data attributes. For example, if we write a name as “Smith, David,” then the comma delimits, or separates, the first and last names. Any combination of characters can be used as delimiters, but the most common are a comma, tab, space, colon and pipe (which is a vertical line, typed as |).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Proprietary file types

A

There are many different proprietary file types, but the most commonly used are .xls or .xlsx, the file types for saving Microsoft Excel documents. When proprietary file types save files, they use(underlying)coding to distinguish the rows and columns.

–Strength: The program “gets it right” when putting the data in the correct columns and rows.

–Weakness: Proprietary file types often cannot be opened in other software and the amount of records they hold can be restricted. For example, Excel files (currently) can hold approximately 1millionrows of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

CSV

A

(comma-separated values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

TSV

A

(tab-separated values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

pipe

A

(|)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

XBRL

A

eXtensible Business Reporting Language

is a freely available and global framework for exchanging business information. XBRL allows the expression of semantic meaning commonly required in business reporting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

text qualifier

A

When a delimiter (such as a comma) is used as a legitimate part of the text, you need to tell the program that it actually is part of the text. To do this, you need to add text qualifiers to the text. The most common text qualifier is the double quotation marks (“”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Understand the importance of unique identifiers when dealing with data, and how they can save you time and prevent errors.

A

example: location number + employee ID numbers unique identifiers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

5 main types of join

A
  1. inner join
  2. left join
  3. right join
  4. full outer join
  5. cross join
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Inner join

A

The inner join(sometimes just called a join) combines data in two tables that matches one or moreidentified attribute. Importantly, an inner join will not pull the data from the tables if there is no match ofthe identified attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Left join - merging/joining data

A

The left join combines all data from the table listed on the left and only data that matches the identified attributes from the right table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Right Join - merging/joining data

A

the right join functions similar to the left join, except it keeps all data in the right table and only merges matching data from the left table. A left join and right join will produce the exact same results if you switch which tables are listed on the left or right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

full outer join

A

A full outer join returns all values from both tables when they match on a specified dimension, and then returns all values that do not match on that dimension with a null value for the non-matching fields.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Cross join (or Cartersian product)

A

A cross join (or Cartesian product) does not use any variable to match, rather it pairs every single instance in one table with every other instance of the other table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Aggregation

A

Aggregation is the level at which the data is summarized. It can be at a low level (no aggregation is used)or at a high level (data is aggregated into a single number).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Data formats

A

Data can be formatted in many ways. A format specifies how the data should be treated. Common formats include treating data as a number, text, percent, scientific notation, etc. It is important to understand all of the different formats used in data and what they mean.

Similarly, different data formats often don’t “speak with each other.” If a unique identifier of 1731 is listed as a number in one data set but the identical number 1731 is listed as a text string in another data set, the tables will not merge correctly until the formats are the same. Each program processes formats differently, so making sure you understand how your program deals with formats is important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Common messy data problems

A
  1. Data formats
  2. Dates
  3. Duplicate and redundant data
  4. Units of measurement
  5. International differences
    6.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Creating a repeatable ETL process

A
  1. Data format
  2. Data scope
  3. Documentation
  4. Automation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Mastering the data can also be described via the ETL process. The ETL process stands for __________.

  1. Enter, transform and load
  2. Enter, total and load
  3. Extract, transform and load
  4. Extract, total and load
A

3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Which of the following is not a common way that data will need to be cleaned after extraction and validation?

  1. Remove heading and subtotals
  2. Remove trailing zeros
  3. Correct inconsistencies across data
  4. Format negative numbers
A

2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the purpose of transforming data?

  1. Validate the data for completeness and integrity
  2. Identify which data are necessary to complete the analysis
  3. Load the data into the appropriate tool for analysis
  4. Obtain the data from the appropriate sources
A

3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Which of these is not included in the five steps of the ETL process?

  1. Scrub the data
  2. Validate the data for completeness and integrity
  3. Determine the purpose and scope of the data request
  4. Obtain the data
A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are attributes that exist in a relational database that are neither primary nor foreign keys?

  1. Relational table attributes
  2. Composite key
  3. Descriptive attributes
  4. Nondescript attributes
A

3

28
Q

The advantage of storing data in a relational database include which of the following?

a - Help in enforcing business rules
b - Increased information redundancy
c - Integrating business processes

  1. All possible choices
  2. Only B and C
  3. Only A and B
  4. Only A and C
A

4

29
Q

Which of the following is the metadata that describes each attribute in a database?

  1. Data dictionary
  2. Flat file
  3. Composite primary key
  4. Descriptive attributes
A

1

30
Q

Which attribute is required to exist in each table of a relational database and serves as the “unique identifier” for each record in a table?

  1. Key attribute
  2. Foreign key
  3. Primary key
  4. Unique identifier
A

3

31
Q

In data preparation, which step includes extracting the information from any source?

  1. Transform
  2. Gather
  3. Discover
  4. Enrich
A

2

32
Q

Which delimiter does the AICPA recommend in their Audit Data Standards?

  1. Pipe or vertical line
  2. Space
  3. Tab
  4. Comma
A

1

33
Q

6 Steps in data preparation

(according to food analogy article)

A
  1. Gather - extracting info from any source
  2. Discover - meaning/understanding the data
  3. Cleanse - missing values, outliers
  4. Transform - changing data format
  5. Enrich - enhancing raw data to improve insights
  6. Store - store or send for analysis

GDCTES (great days come to each son)

34
Q

composite primary key

A

A special case of a primary key that exists in linking tables. The composite primary key is made up of the two primary keys in the table that it is linking.

35
Q

data dictionary

A

Centralized repository of descriptions for all of the data attributes of a dataset.

36
Q

Data request form

A

A method for obtaining data if you do not have access to obtain the data directly yourself

37
Q

Descriptive (or nonkey) attributes

A

Attributes that exist in relational databases that are neither primary nor foreign keys. These attributes provide business information, but are not required to build a database. An example would be “Company Name” or “Employee Address.”

38
Q

ETL

A

Extract, transform and load process integral to mastering the data

39
Q

Flat file

A

A means of storing data in one place, such as in an Excel spreadsheet, as opposed to storing the data in multiple tables, such as in a relational database

40
Q

Foreign Key

A

An attribute that exists in relational databases in order to carry out the relationship between two tables. This does not serve as the “unique identifier” for each record in a table. These must be identified when mastering the data from a relational database in order to extract the data correctly from more than one table.

The foreign key is another type of attribute, and its function is to create the relationship between two tables

41
Q

Mastering the data

A

The second step in the IMPACT cycle; it involves identifying and obtaining the data needed for solving the data analysis problem, as well as cleaning and preparing the data for analysis..

42
Q

Primary key

A

An attribute that is required to exist in each table of a relational database and serves as the “unique identifier” for each record in a table

43
Q

Relational Database

A

A means of storing data in order to ensure that the data are complete, not redundant, and to help enforce business rules. Relational databases also aid in communication and integration of business processes across an organization

Relational databases are made up of tables with uniquely identified records (this is done through primary keys) and are related through the usage of foreign keys

44
Q

What are the 5 steps of the ETL process?

A
  1. Step 1 Determine the purpose and scope of the data request (extract).
  2. Step 2 Obtain the data (extract).
  3. Step 3 Validate the data for completeness and integrity (transform).
  4. Step 4 Sanitize the data (transform).
  5. Step 5 Load the data in preparation for data analysis (load).
45
Q

RDBMS acronym and 3 examples

A

Relational Database Management Systems

  1. Microsoft Access
  2. SQLite
  3. Microsoft SQL Server
46
Q

Microsoft Access

A

for any user of Microsoft products (Word, Excel, PowerPoint, etc.) the navigation of Microsoft Access is familiar, so it is a relatively easy entry point for working with relational databases. It is a great entry tool to learn how tables are related via primary and foreign keys because entire databases can be built via a graphical user interface instead of having to use SQL statements to create tables and relationships

47
Q

SQLite

A

SQLite is an open-source solution to data management. For a user that is at least somewhat familiar with relational database management, it is a friendly tool, and presents an intuitive interface for writing SQL statements

48
Q

open-source

A

denoting software for which the original source code is made freely available and may be redistributed and modified.

49
Q

Microsoft SQL

A

Microsoft SQL Server can support enterprise-level data in ways that smaller RDBMS programs, such as Access and SQLite, cannot.

is meant to provide experience that replicates working with much larger and more complex datasets that you will likely experience in the professional world.

50
Q

Other examples of RBDMS

A
  1. Teradata
  2. MySql
  3. Oracle RBDMS
  4. IBM DB2
  5. Amazon RDS
  6. PostGreSQL
51
Q

Unified Modeling Language (UML) Class Diagram

A

is an illustration or a drawing of the tables and their relationships to each other (i.e., a database schema)

52
Q

Benefits of a normalized relational database over a flat file (4):

A
  1. Completeness
  2. No redundancy
  3. Business rules are enforced
  4. Communication and integration of business processes
53
Q

What is meant by these 4 benefits of using normalized relational databases over flat files

  1. Completeness
  2. No redundancy
  3. Business rules are enforced
  4. Communication and integration of business processes
A
  1. Completeness - ensures all data required for a business process are included in the dataset
  2. No redundancy - one version of the truth, whereas the redundancy found in flat files takes up unnecessary space (which is expensive), it increases the risk of data-entry errors,
  3. Business rules are enforced - allows for better placement and enforcement of internal controls, that flat files simply cannot.
  4. Communication and integration of business processes - the design of RDs supports business processes which results in improved communication across functional areas.
54
Q

What are the three types of columns in a relational database table?

A
  1. Primary Key (PK)
  2. Foreign Key (FK)
  3. Descriptive Attributes
55
Q

Procure-to-pay

A

Procure to pay is the process of requisitioning, purchasing, receiving, paying for and accounting for goods and services.

56
Q

ADS developed by the AICPA

A

Audit Data Standards

While the ADSs provide an opportunity for standardization, they are voluntary

57
Q

What is SQL? what are its uses and specific use related to DA?

A

SQL stands for structured query language

is a computer language that can be used to create, update, and delete records and tables in relational databases, but in Data Analysis, the focus is on extracting data—that is, to select the precise attributes and records that fit the criteria of our data analysis goal

58
Q

Important Appendices in the Book

A

Appendices:

D & H for SQL - writing queries and creating joins

C for excel Vlookup

59
Q

What is the Vlookup function in Excel and what is it used for?

A

One of Excel’s most useful tools for looking up data from two separate tables and matching them based on a matching primary key/foreign key relationship is the VLookup function. There are a variety of ways that the VLookup function can be used, but for extracting and transforming data it is best used to add a column to a table.

60
Q

When to use SQL vs. VLookup for ET?

A

SQL - pulling out specific information of interest to answer your biz question

VLookup - for exploratory analysis where you don’t mind pulling out the entire table of data. Note: Excel has the 1M row limit and it can get super slow.

61
Q

What 4 steps help validate data after Extraction?

A
  1. Compare the # of records that were extracted to the # of records in the source database.
  2. Compare descriptive statistics for numeric fields (calculating min, max, avg, medians - help ensure numeric data were extracted completely)
  3. Validate Date/Time Fields - converting to numeric and running descriptive statistics
  4. Compare string limits for text fields - ensure you haven’t cutoff any characters
62
Q

Descriptive Statistics

Central Tendency

Variability

A

Descriptive statistics is distinguished from inferential statistics (or inductive statistics) by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.

Measures of central tendency include the mean, median and mode

measures of variability include the standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness.[3]

63
Q

Kurtosis

A

In probability theory and statistics, kurtosis (from Greek: κυρτός, kyrtos or kurtos, meaning “curved, arching”) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution and there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. Different measures of kurtosis may have different interpretations.

Pearson Kurtosis - higher kurtosis corresponds to greater extremity of deviations (or outliers), and not the configuration of data near the mean.

Moor’s Kurtosis interpretation

(RV-Mean)/(SD) where RV = random variable

64
Q

Skewness

A

In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined.

For a unimodal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right.

A left-skewed distribution usually appears as a right-leaning curve.

A right-skewed distribution usually appears as a left-leaning curve.

The mass of the distribution is generally concentrated on the opposite side of the name. Since the terms left and right-leaning refer to where the tail is.

Sometimes approximated nonpararmetrically as:

(Mean-Median)/(SD)

65
Q

What are 4 common ways data has to be cleaned after extraction and validation?

A
  1. Remove headings or subtotals
  2. Clean leading zeroes and nonprintable characters - happens when stored in source DB as text and need to be analyzed as #s
  3. Format negative numbers
  4. Correct inconsistencies across data, in general
66
Q

5 main quality issues to look for when analyzing a data set for the first time:

A
  1. Dates - main issue is format - Preferred format is yyyy-mm-dd (ISO 8601) (from big y to small d)
  2. Numbers - 1 vs I, 0 vs. O, 7 vs. seven, or $, etc. Remove any extra char. to leave raw #
  3. International characters and encoding - ASCII vs Unicode, invisible comp. characters (tabs, returns, line breaks, etc.)

4. Languages and measures - cheese vs frommage, pounds or lbs, dollar vs. euro

5. Human error