5 Data Wrangling and Manipulation Flashcards
What is the primary focus of a data analyst’s work?
Preparing data for use
What is data wrangling?
The process of cleaning and shaping data into a specific format
What does the term ‘manipulation’ mean in the context of data?
Handling and managing data in a skillful way
List the main topics covered in the data-wrangling process.
- Merging data
- Calculating derived and reduced variables
- Parsing your data
- Recoding variables
- Shaping data with common functions
What is a key variable?
A variable that is present in both tables being merged, allowing rows to be matched
What is an inner join?
A join that includes only values found in both of the old tables
True or False: Inner joins include all data points from both tables.
False
What is an outer join?
A join that includes every data point from both tables, regardless of matches
What is a left join?
A join that contains all data points from the left table and matching values from the right table
What is a right join?
A join that contains all data points from the right table and matching values from the left table
What is data blending?
A temporary link between two tables through a left join without creating a new table
What is the difference between concatenation and appending?
Concatenation merges data in a series; appending adds a new value to the end of an existing series
Define derived variables.
Variables generated based on observed data using logic
What are metrics in the context of derived variables?
Derived variables that calculate a number to gauge the status of a data point
What are flags in the context of derived variables?
Categorical variables that summarize the status of another variable or data point
Fill in the blank: The majority of Key Performance Indicators (KPIs) are _______.
[metrics]
What is the consequence of using both derived variables and the variables used to calculate them in an analytical model?
It can cause multicollinearity
What is the purpose of a key table in a database schema?
To store key variables that allow for merging of other tables
What is the most conservative join type that results in a smaller final table?
Inner join
Explain the concept of pairwise deletion.
A technique used in conjunction with outer joins to maximize data use from small datasets
What does parsing data involve?
Breaking down chunks of text into usable formats
What is recoding in the data-wrangling process?
The process of changing variable values for clarity or analysis purposes
What is the visual representation of an inner join?
A Venn diagram showing the overlap between two datasets
What is a common characteristic of outer joins?
They can produce many null values when data points do not match