2B. Clean, transform, and load data in Power BI Flashcards

1
Q

What are some advantages to clean(ing) data?

A
  • Measures and columns produce more accurate results when they perform aggregations and calculations.
  • Tables are organized, where users can find the data in an intuitive manner.
  • Duplicates are removed, making data navigation simpler. It will also produce columns that can be used in slicers and filters.
  • A complicated column can be split into two, simpler columns. Multiple columns can be combined into one column for readability.
  • Codes and integers can be replaced with human readable values.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some basic transformation you can do when you begin cleaning your data?

A
  • Identify column headers and names
  • Promote headers
  • Rename columns
  • Remove top rows
  • Remove columns
  • Unpivot columns
  • Pivot columns
  • Rename queries
  • Replace values (e.g. spelling mistakes)
  • Replace null values (only if nulls should be counted as zero, for example)
  • Remove duplicates
  • Change column data types
  • Combine multiple tables into a single table
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What principle should guide you when thinking about which columns to keep and which to remove?

A

Examine each column and ask yourself if you really need the data that it contains. If you don’t plan on using that data in a report, the column adds no value to your semantic model. Therefore, the column should be removed. You can always add the column later, if your requirements change over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a benefit to unpivoting columns (i.e. a benefit of tall tables)?

A

It streamlines the process of creating DAX measures on the data later. By completing this process, you have now created a simpler way of slicing the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a benefit to pivoting columns?

A

You can use the Pivot Column feature to convert your flat data into a table that contains an aggregate value for each unique value in a column. For example, you might want to use this feature to summarize data by using different math functions such as Count, Minimum, Maximum, Median, Average, or Sum.

On the Pivot Column window that displays, select a column from the Values Column list, such as Subcategory name. Expand the advanced options and select an option from the Aggregate Value Function list, such as Count (All), and then select OK.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some best practices for naming tables, columns, and values?

A

Naming conventions for tables, columns, and values have no fixed rules; however, we recommend that you use the language and abbreviations that are commonly used within your organization and that everyone agrees on and considers them as common terminology.

A best practice is to give your tables, columns, and measures descriptive business terms and replace underscores (“_”) with spaces. Be consistent with abbreviations, prefixes, and words like “number” and “ID.” Excessively short abbreviations can cause confusion if they are not commonly used within the organization.

Also, by removing prefixes or suffixes that you might use in table names and instead naming them in a simple format, you will help avoid confusion.

When replacing values, try to imagine how those values will appear on the report. Values that are too long might be difficult to read and fit on a visual. Values that are too short might be difficult to interpret. Avoiding acronyms in values is also a good idea, provided that the text will fit on the visual.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the default number of rows that Power Query loads into preview (and therefore bases column analyses on)?

A

1000 rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What files are more likely to contain data type errors?

A

You have a higher chance of getting data type errors when you’re dealing with flat files, such as comma-separated values (.CSV) files and Excel workbooks (.XLSX), because data was entered manually into the worksheets and mistakes were made. Conversely, in databases, the data types are predefined when tables or views are created.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some implications of incorrect data types?

A

They will prevent you from creating certain calculations, deriving hierarchies, or creating proper relationships with other tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some reasons to combine tables into a single table?

A
  • Too many tables exist, making it difficult to navigate an overly complicated semantic model.
  • Several tables have a similar role.
  • A table has only a column or two that can fit into a different table.
  • You want to use several columns from different tables in a custom column.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two ways of combining tables?

A
  • Merging (adding columns from one table (or query) into another).
  • Appending (adding rows of data to another table or query).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is required when appending tables?

A

The pertinent columns that you require in your combined table must be named the same in your original data tables to see one consolidated view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is required when merging tables?

A

You must have a column that is the key between the two tables. This process is similar to the JOIN clause in SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the three main join options when merging tables?

A
  • Left Outer - Displays all rows from the first table and only the matching rows from the second.
  • Full Outer - Displays all rows from both tables.
  • Inner - Displays the matched rows between the two tables.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is profiling data about, and why is it important?

A

It is about studying the nuances of the data: determining anomalies, examining and developing the underlying data structures, and querying data statistics such as row counts, value distributions, minimum and maximum values, averages, and so on.

This concept is important because it allows you to shape and organize the data so that interacting with the data and identifying the distribution of the data is uncomplicated, therefore helping to make your task of working with the data on the front end to develop report elements near effortless.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the different statistics you can rely on in Power Query preview to profile data?

A
  • Column quality
  • Column distribution
  • Column profile
  • Column statistics
17
Q

What does column quality in the Power Query editor tell you?

A

It shows you the percentages of data that is valid, in error, and empty. In an ideal situation, you want 100 percent of the data to be valid.

18
Q

What does column distribution in the Power Query editor tell you?

A

It shows you the distribution of the data within the column and the counts of distinct and unique values, both of which can tell you details about the data counts.

19
Q

What does column profile in the Power Query editor tell you?

A

It gives you a more in-depth look into the statistics within the columns for the first 1,000 rows of data. This column provides several different values, including the count of rows, which is important when verifying whether the importing of your data was successful. For example, if your original database had 100 rows, you could use this row count to verify that 100 rows were, in fact, imported correctly. Additionally, this row count will show how many rows that Power BI has deemed as being outliers, empty rows and strings, and the min and max, which will tell you the smallest and largest value in a column, respectively. This distinction is particularly important in the case of numeric data because it will immediately notify you if you have a maximum value that is beyond what your business identifies as a “maximum.” This value calls to your attention these values, which means that you can then focus your efforts when delving deeper into the data. In the case where data was in the text column, as seen in the previous image, the minimum value is the first value and the maximum value is the last value when in alphabetical order.

20
Q

What does value distribution graph in the Power Query editor tell you?

A

It tells you the counts for each distinct value in that specific column.

21
Q

What language does Power Query use?

A

M code. Even when you use the graphical interface to transform and clean data, all steps are recorded as M code “behind the scenes.” Each step can always be modified directly in M code.

22
Q

How do I see the “complete” M code for one query in Power Query editor?

A

Click the Advanced Editor (in Power Query)

23
Q

What should I keep in mind when editing M code in Advanced Editor in Power Query?

A

M code is written top-down. Later steps in the process can refer to previous steps by the variable name to the left of the equal sign. Be careful about reordering these steps because it could ruin the statement dependencies.