Big Data Projects Flashcards

1
Q

Identify asteps in a data analysis project.

A

Conceptualization of the modeling task.
Data Collecor
Data preparation and wrangling.
Data exploration.
Model training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Conceptualization of the modeling task.

A

define the problem at hand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data preparation and wrangling

A

cleaning the data set and preparing it for the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data exploration.

A

feature selection and engineering and initial data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Model training

A

determining the appropriate ML algorithm to use, evaluating the algorithm using a training data set, and tuning the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

steps of preparing and wrangling data.

A

critical step involves cleansing and organizing raw data for use in a model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data cleansing

A

deals with reducing errors in the raw data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data wrangling

A

involves preprocessing data for model use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Preprocessing includes

A

data transformation and scaling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

scaling

A

Conversion of data features to a common unit of measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Two common methods of scaling

A

normalization and standardization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Normalization scales

A

normalized X=(X−Xmin)/(Xmax−Xmin)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Standardization Scales

A

standardized Xi=Xi−μσ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Text processing

A

cleansing and preprocessing of text-based data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Text cleansing involves the following steps:

A

Remove HTML tags.
Remove punctuations.
Remove numbers.
Remove white spaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Cleansed text is then normalized using the following steps:

A

Lowercasing.
Removal of stop words
Stemming.
Lemmatization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Stemming

A

converts all variations of a word into a common value

18
Q

Lemmatization

A

conversion of inflected forms of a word into its lemma

19
Q

tokenization

A

is the process of splitting a sentence into tokens

20
Q

Data exploration

A

seeks to evaluate the data set and determine the most appropriate way to configure it for model training

21
Q

Steps in data exploration include the following:

A

Exploratory data analysis (EDA)
Feature selection.
Feature engineering

22
Q

Exploratory data analysis (EDA)

A

involves looking at data descriptors

23
Q

Feature selection

A

is a process to select only the needed attributes of the data for ML model training.

24
Q

Feature engineering

A

is the process of creating new features by transforming

25
Q

Data Exploration for Structured Data

A

With EDA, structured data is organized in rows (observations) and columns (features).

With feature selection, we try to select only the features that contribute to the out-of-sample predictive power of the model.

Feature Engineering (FE) involves optimizing and improving the selected features.

26
Q

Model fitting errors can be caused by:

A

Size of the training sample.
Number of features.

27
Q

The three tasks of model training are as follows:

A

1- Method selection
2-Performance evaluation
3- Tuning

28
Q

Techniques to Measure Model Performance

A

1 - Error analysis.
2 - Receiver operating characteristic (ROC).
3 - Root mean square error (RMSE).

29
Q

Error analysis.

A

Errors in classification problems can be false positives (type I error) or false negatives (type II error).

30
Q

precision (P)

A

= TP / (TP + FP)

31
Q

recall (R)

A

= TP / (TP + FN)

32
Q

accuracy

A

= (TP + TN) / (TP + TN + FP + FN)

33
Q

F1 score =

A

= (2 × P × R) / (P + R)

34
Q

Receiver operating characteristic (ROC)

A

Also used for classification problems, the ROC is a curve that plots the tradeoff between FPs and TPs.

35
Q

TPR

A

= TP / (TP + FN)

36
Q

FPR

A

= FP / (FP + TN)

37
Q

Root mean square error (RMSE)

A

RMSE=⎷(predicted−actual)2/n

38
Q

Parameters

A

are estimated by the model (e.g., slope coefficients in a regression model) using an optimization technique on the training sample.

39
Q

Hyperparameters

A

are specified by ML engineers, and are not dependent on the training sample.

40
Q

Ceiling analysis

A

Ceiling analysis is an evaluation and tuning of each of the components in the entire model-building pipeline.