Simplilearn Data Analyst Questions Flashcards

1
Q

What is data mining?

A

process of finding new, relevant information; it takes raw data and transforms it into valuable information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data profiling?

A

process of assessing a dataset for uniqueness, consistency, and logic; it usually doesn’t involving identifying incorrect/inaccurate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is data wrangling?

A

the process of cleaning, structuring, and enriching raw data into a desired, usable format for better decision-making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a simple process of data wrangling?

A

discover –> structure –> clean –> enrich –> validate –> analyze

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Wrangling vs Data Cooking?

A

Data cooking involves falsifying data or selectively deleting data to improve a hypothesis

An example is demographic data being manipulated by fieldworkers, researchers, etc. to support behavioral science theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are common problems data analysts encounter during analysis?

A

1) handling duplicate/missing values
2) collecting meaningful, correct data at the right time
3) making data secure
4) dealing with compliance issues (ensuring that sensitive data is organized and meets enterprise business rules and legal/govt regulations)
5) handling data purging (freeing up database space) and storage issues

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some steps in the analytics project?

A

1) state/understand the problem
2) collect data
3) clean data
4) explore and analyze the data
5) interpret results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some technical tools used for analysis and presentation?

A

MS SQL Server, MySQL, MS Excel, IBM SPSS, Tableau, Python, MS PowerPoint

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some best practices for data cleaning?

A

Make a data cleaning plan by understanding where common errors happen and keep communications open;

Identify and remove duplicates before working with data;

Focus on accuracy, maintain value types of data, provide mandatory constraints, and set cross-field validation;

Standardize the data at the point of entry so that there’s less chaos and fewer errors occur

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you handle missing values in a dataset?

A

Listwise deletion: an entire record is excluded from analysis if any single value is missing

Average imputation: Use the average value of the responses from other participants to replace missing values

Regression substitution: use multiple-regression analysis to estimate a missing value

Multiple imputation: create plausible values based on correlations for missing data and then averages the simulated datasets by incorporating random errors in predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a normal distribution?

A

It is a type of continuous probability distribution that is symmetric about a mean and appears as a bell curve.

mean = median = mode and they are located at the center of the graph

68% of data lies within 1 std away from the mean/avg

95% data falls within 2 std away from mean/avg

99.7% data falls within 3 std away from mean/avg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is time series analysis?

A

Time series analysis is a statistical method that deals with an ordered sequence of values of a variable at equally spaced time intervals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In Tableau, what differs between joining and blending?

A

Data joining can only be done when data comes from the same source. So to combine two tables, the tables must be from the same databases or two/more sheets from the same Excel file.

Meanwhile, data blending is used when data is from 2/more different sources. An example would be combining an Oracle table with SQL Server or combining an Excel sheet and Oracle table.

In data joining, all combined tables/sheets contains a common set of dimensions/measures. On the other hand, data blending has each data source contain its own set of dimensions/measures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Overfitting vs Underfitting

A

Overfitting: A model trains the data too well using the training set, causing significant performance drops over the test/validation set. This happens when the model understands noise and random fluctuations too well and over-specifies.

Underfitting: A model is not able to train data well or generalize new data. It performs poorly on both training and testing data. This happens when there is less data to build an accurate model or if the model does not suit the data (e.g.: using a linear model on non-linear data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In MS Excel, a numeric value can be treated as a text value if it is preceded by an…

A

Apostrophe (‘)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

In Excel, what is the difference between COUNT, COUNTA, COUNTBLANK, and COUNTIF?

A

COUNT: returns the number of numeric cells in a range

COUNTA: returns the number of non-blank cells in a range

COUNTBLANK: returns the number of blank cells in a range

COUNTIF: returns the number of values by checking a given condition

17
Q

How does the function VLOOKUP work in Excel?

A

Can be used to find things in a table or a range by row.

VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])

lookup_value: value you want to look up

table_array: range where lookup_value is located

col_index_num: column number in the range that contains the return value

range_lookup: specify TRUE if you want to approximate match or FALSE if you want an exact match of the return value

18
Q

In SQL, how do you subset or filter data?

A

Use WHERE and HAVING clauses

Ex1: select * from MOVIES where Director = ‘Brad Bird’;

Ex2: 
select Director, 
sum(Duration) as total_duration,
avg(Duration) as avg_duration
from Movies
group by Director
having avg(Duration)>151;
19
Q

In SQL what is the difference between WHERE and HAVING clauses?

A

the WHERE clause: works on row data, filter occurs before any groupings; aggregate functions cannot be used

SELECT col1, col2,…
FROM table_name
WHERE condition;

Meanwhile…

the HAVING clause: works on aggregated data; used to filter values from a group; allows aggregate functions

SELECT col1, col2, ...
FROM table_name
WHERE condition
GROUP BY column_names
HAVING condition
ORDER BY column_names
20
Q

How do you use the reshape() function in NumPy?

A

reshape(array, shape)

Ex1:
import numpy as np
a = np.array([[1,2,3,4,5],[6,7,8,9,10]])
np.reshape(a, (2,5))

21
Q

What are different ways to create a dataframe in Pandas?

A

import pandas as pd…

1) by initializing a list
data = [[‘tom’, 30], [‘jerry’, 20], [‘angela’, 35]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])

2) by initializing a dictionary
data = {‘Name’:[‘Tom’, ‘Jerry’, ‘Angela’], ‘Age’: [20, 35, 19]}
df = pd.DataFrame(data)

22
Q

Suppose you have an “emp.csv” file. Using Python, how can you create an employees dataframe and display the head and summary?

A

import pandas as pd

employees = pd.read_csv(“File_path”)

employees. head()
employees. info()

23
Q

In Python, how do you select specific columns from a dataframe?

A

for one column: df[col_name]

for two or more columns: df[[col1, col2,…]]

24
Q

What criteria entails that a developed data model is good or bad?

A

Good models are…

intuitive, insightful, and self-explanatory

easily consumed by the clients for actionable and profitable results

easily adapted to changes according to business requirements

scale according to new data when data gets updated

25
Q

Why is Exploratory Data Analysis so important?

A

EDA helps understand data better.

Helps obtain confidence in data, giving reassurance to go ahead and engage in developing a ML algorithm

Allows you to refine your selection of feature variables for later model building

Allows you discover hidden trends and insights

26
Q

How do you treat outliers?

A

drop, cap your outliers, assign new values, or try a new transformation (as last resort)

27
Q

Descriptive vs predictive vs prescriptive (analytics)?

A

Descriptive: provides insights into past/history to answer “What has happened?”; uses data aggregation and mining techniques

Predictive: understands the futur and answers “What could happen?”; uses statistical models and forecasting techniques

Prescriptive: suggests various courses of action to answer “What should you do?”; uses optimization and simulation algorithms to advise possible outcomes

28
Q

What are different types of sampling techniques?

A

Sampling = statistical method that selects a subset of data from an entire dataset (population) to estimate the characteristics of the population

simple random sampling

systematic sampling

cluster sampling

stratified sampling

judgmental/purposive sampling

29
Q

What are the different types of hypothesis testing?

A

null hypothesis: states there’s no relation between the predictor and outcome variables in the population; denoted by H0

alternative hypothesis: states there’s some relation between the predictor and outcome variables in the population; denoted by H1