5 Data Wrangling and Manipulation Flashcards

1
Q

What is the primary focus of a data analyst’s work?

A

Preparing data for use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data wrangling?

A

The process of cleaning and shaping data into a specific format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the term ‘manipulation’ mean in the context of data?

A

Handling and managing data in a skillful way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

List the main topics covered in the data-wrangling process.

A
  • Merging data
  • Calculating derived and reduced variables
  • Parsing your data
  • Recoding variables
  • Shaping data with common functions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a key variable?

A

A variable that is present in both tables being merged, allowing rows to be matched

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an inner join?

A

A join that includes only values found in both of the old tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: Inner joins include all data points from both tables.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is an outer join?

A

A join that includes every data point from both tables, regardless of matches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a left join?

A

A join that contains all data points from the left table and matching values from the right table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a right join?

A

A join that contains all data points from the right table and matching values from the left table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is data blending?

A

A temporary link between two tables through a left join without creating a new table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between concatenation and appending?

A

Concatenation merges data in a series; appending adds a new value to the end of an existing series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define derived variables.

A

Variables generated based on observed data using logic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are metrics in the context of derived variables?

A

Derived variables that calculate a number to gauge the status of a data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are flags in the context of derived variables?

A

Categorical variables that summarize the status of another variable or data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Fill in the blank: The majority of Key Performance Indicators (KPIs) are _______.

A

[metrics]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the consequence of using both derived variables and the variables used to calculate them in an analytical model?

A

It can cause multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the purpose of a key table in a database schema?

A

To store key variables that allow for merging of other tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the most conservative join type that results in a smaller final table?

A

Inner join

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain the concept of pairwise deletion.

A

A technique used in conjunction with outer joins to maximize data use from small datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does parsing data involve?

A

Breaking down chunks of text into usable formats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is recoding in the data-wrangling process?

A

The process of changing variable values for clarity or analysis purposes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the visual representation of an inner join?

A

A Venn diagram showing the overlap between two datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a common characteristic of outer joins?

A

They can produce many null values when data points do not match

25
How is Speed calculated?
Speed is calculated by dividing Distance by Time.
26
What does the Speed variable represent in the context of training for a race?
The Speed variable is a KPI that shows performance and progress towards a goal.
27
What are metrics in data analytics?
Metrics are derived variables that show quantitative data.
28
What are %ags in data analytics?
%ags are derived variables that show qualitative data.
29
What are flags used for in data analytics?
Flags are categorical variables that summarize the status of another variable or data point.
30
What are reduction variables?
Reduction variables, or aggregate variables, reduce the volume of data by summarizing multiple variables.
31
List some basic methods of aggregation.
* Average * Sum * Maximum * Minimum * Count * Distinct Count
32
What is parsing in data analytics?
Parsing is breaking a single large piece of data into several smaller pieces that can be easily identified and processed.
33
What is tokenization in Natural Language Processing (NLP)?
Tokenization is the process of breaking up text into words, with each becoming its own object or token.
34
What is recoding in data analytics?
Recoding is turning variables into a different format, such as translating quantitative variables into qualitative variables.
35
How can numeric variables be recoded into categories?
Numeric variables can be recoded into categories based on ranges.
36
What is dummy coding?
Dummy coding creates a new binary variable for every possible category in the original variable.
37
Why should you drop one dummy-coded variable from a model?
Dropping one dummy-coded variable prevents multicollinearity by avoiding perfect prediction among variables.
38
What is the challenge with date variables in data analytics?
Date variables are handled differently by every program, making them difficult to work with.
39
What are conditional operators in programming?
Conditional operators are code snippets that allow for the creation of conditional logic.
40
List the four basic conditional operators.
* IF * AND * OR * NOT
41
What does transposing data involve?
Transposing data involves changing the axis of the data, turning columns into rows and vice versa.
42
What are system functions in data analytics?
System functions provide information about file paths and the local environment during data-wrangling.
43
What is the primary focus of derived variables?
Derived variables focus on summarizing data.
44
What is the importance of parsing data for NLP?
Parsing data is necessary for translating language into actionable data in NLP.
45
Fill in the blank: The process of breaking a sentence into words is known as _______.
tokenization
46
True or False: Reduction variables are used to increase the volume of data.
False
47
Fill in the blank: Recoding variables can help to translate quantitative variables into _______.
qualitative variables
48
What is the purpose of creating a SpeedCategory variable?
The SpeedCategory variable groups speeds based on average performance during a race.
49
The following picture represents what kind of join?
Inner join ## Footnote An inner join adds only the data that both datasets have in common to the new table.
50
Only using the Distinct Count of a dataset is an example of what?
Reduction ## Footnote Distinct Count is used to summarize data and reduces the amount of data to process.
51
The following is an example of what concept? Data = 'This is a sentence?' Data = ['This', 'is', 'a', 'sentence', '?']
Parsing ## Footnote Parsing involves breaking down large chunks of data into smaller, processable pieces.
52
The following is an example of what concept?
Dummy coding ## Footnote Dummy coding creates a new variable for every possible outcome of a categorical variable.
53
Which of the following is a logical operator? A. IF B. NOT C. OR D. All of the above are logical operators
All of the above are logical operators ## Footnote Common logical operators include IF, AND, OR, and NOT.
54
Fill in the blank: Distinct Count is an example of _______.
Reduction
55
Fill in the blank: Parsing is the concept of breaking down large chunks of data into _______.
smaller pieces
56
Fill in the blank: Dummy coding is a specific type of recoding that creates a new variable for every possible _______ of a categorical variable.
outcome
57
True or False: An inner join includes all data from both datasets.
False ## Footnote An inner join includes only data that both datasets have in common.
58
What is the main purpose of using Distinct Count in data analysis?
To summarize data and reduce processing load