Module 3 - Preparing and Cleaning Data for Analysis Flashcards

1
Q

Selecting relevant data for your analysis includes determining what?

A

determining the type(s) of data that you need and finding a source for the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain this “When selecting data for a project, it is important to focus on finding data that may provide insights into your original business question.”

A

For example, if you are seeking to understand demographic characteristics of people who bought Product X in the past year, you should only be using data that is directly related to Product X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What to do if sometimes the data you need to answer your questions isn’t readily available?

A

It may be necessary to establish new procedures to collect the data required for your analysis. Other times, it may involve combining data from multiple sources into a format that can be analyzed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain on what can be done on this situation: “ an entertainment producer gathering data about the viability of a movie project.”

A

If the movie is an adaptation of a book, they need data on the sales of books by that author, within that genre, and across a variety of population demographics. They might compare the profitability of other movies with similar plots or characters, and their release dates, to determine the best time of the year to release a picture of that genre. Producers may also analyze data on the actors and locations that appear in the most successful recent movies to make casting and production decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Enumerate the questions that you should ask yourself when selecting a data source.

A

Some questions that you should ask yourself when selecting a data source:
a. What data points are necessary to inform your analysis?
b. Do I already have access to this data, or must I find a dataset from another source?
c. Where are reliable and verifiable sources of this data?
d. How often is the relevant data collected and updated?
e. How the data is licensed for use, and is there a cost?
f. Is the data in a format that I can use, or convert to use, with my tools?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two types of data that analysts work with?

A

static data and streaming data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data that is received and stored prior to performing analysis on the data is considered ?

A

static data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

When each event is processed and analyzed as it is received and subsequent results are used or stored, the data is referred to as?

A

streaming data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

is a data type used to represent text or sequences of characters. It can include letters, numbers, symbols, and spaces. Strings are commonly used to store names, addresses, sentences, and any other textual information. In programming, strings are typically enclosed within single (‘’) or double (“”) quotation marks.

A

String

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

is a whole number without any decimal or fractional parts. It represents a count or quantity that can be positive, negative, or zero. Integers are used to store values like counts of items, ages, and identifiers. They are typically represented without a decimal point.

A

Integer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

is a data type used to represent numbers with decimal points. Floats can represent a wide range of values, including both integers and fractions. They are used for calculations involving precision, such as scientific calculations, measurements, and financial computations.

A

A floating-point number, often referred to as a “float,”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

This data type is used to represent specific points in time. It can include information about the year, month, day, hour, minute, second, and sometimes even milliseconds. This data type is crucial for recording events, scheduling, and performing temporal calculations. Formats for representing date and time values can vary depending on the programming language and system.

A

The date and time datatype

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

This data type represents binary values: either true or false, yes or no, on or off. Booleans are used to make logical comparisons and decisions in programming. They are often used in conditional statements and expressions to control the flow of a program. Booleans help determine the validity of conditions or statements.

A

The boolean data type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

refers to data that is entered and maintained in defined fields within a file or record.

A

Structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

is easily entered, classified, queried, and analyzed by a computer.

A

Structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is data found in relational databases and spreadsheets structured?

A

Yes, for example, when you submit your name, address, and billing information to a website, you are creating structured data. The database may force you to enter it in a certain format for a computer to interpret it easily.

17
Q

What are the characteristics of structured data?

A
  • There is a well-defined and organized structure.
  • It can be stored in tables, usually within vertical columns and horizontal rows.
  • The content and format of the data is documented.
  • It is organized into files, records, and fields.
  • It can be searched, sorted, and queried.
  • Input controls can reduce the possibility of invalid data.
18
Q

is a lightweight data-interchange format that is easy for humans to read and write.

A

JSON

19
Q

is a markup language that is similar to HTML.

A

XML

20
Q

There are many different types of structured data files that can either be created by humans or machine-generated. Give some few examples

A
  • Relational Databases
  • Logs
  • Spreadsheets
  • Sensor readings
  • Transactional Records
21
Q

This data is raw data, not organized in a predefined way. It does not possess a fixed schema that identifies the type of data or its format. This type of data lacks a set way of entering or grouping the data, and then analyzing the data.

A

Unstructured data

22
Q

Give some examples of unstructured data

A

Examples of unstructured data include the content of photos, audio, video, web pages, blogs, books, journals, white papers, PowerPoint presentations, articles, email, wikis, word processing documents, and text in general. Even the PDF version of this chapter is unstructured. The text is searchable, but it is not organized in a predefined format. Unstructured data can even be a traffic camera feed that is continuously sending images for processing.

23
Q

is an automated process which uses a bot or web crawler to gather and copy specific data from the web to a database or spreadsheet. The data can then be easily analyzed.

A

Web scraping

24
Q

These are the two versions for moving or processing data through a pipeline

A

ETL and ELT

25
Q

Explain Extract, Transform and Load (ETL)

A

Extract, Transform and Load (ETL) is a process for collecting data from this variety of sources, transforming the data, and then loading the data into a database. One company’s data might be found in Word documents, spreadsheets, plain text, PowerPoints, emails and PDF files. Another company’s data may be housed in relational databases. This data can be stored in a variety of different formats, making it difficult to combine and analyze, so the transformation happens before loading.

26
Q

Explain Extract, Load, Transform (ELT) process

A

In an Extract, Load, Transform (ELT) process, the load and transform steps are reversed. ELT enables raw data to skip the transformation step and go straight to storage in an unstructured form. Transformation then occurs on the stored data as it is used. The ELT process is used primarily for large amounts of unstructured data.

27
Q

Explain the Step 1: Extract

A

In this step, data is located and gathered from various sources in order to be converted into a single format for analysis. The data may be extracted from a relational database, NoSQL, flat files, XML files, or other formats.

28
Q

Explain Step 2: Transform

A

Data usually must be transformed before it can be loaded into a data warehouse for analysis. The transform step uses rules to transform the source data to the type of data needed for the target database. This includes converting any measured data to the same dimension (e.g., Imperial to Metric). The transformation step also requires several additional tasks. Some of these tasks are joining data from several sources, aggregating, sorting, determining new values that are calculated from aggregated data, and then applying validation rules.
Data (possibly including some empty or error data) may go through another part of the transform step known as ‘cleaning’ or ‘scrubbing’ data, and validation lets you know whether the data needs cleaning. Some examples of data cleaning are removing blank records and standardizing formats such as date, time, and location. The cleaning part of the transform step further ensures the consistency of the source data.

29
Q
A