Big Data Projects Flashcards
Steps in Big Data Analysis/Projects: Traditional with strucutred data.
**Conceptualize the task **-> Collect data -> Data Preperation & processing -> Data Exploration -> Model traning.
Steps in Big Data Analysis/Projects: Textual Bid Data.
Text probelm formulation -> Data Curation ->** Text preperation and processing** -> Text exploration -> Classifier output.
Preperation in strucutred data: Extraction
Creating a new variable from an already existing one for easing the analysis.
Example: Date of birth -> Age
Preperation in strucutred data: Aggregation
2 or more variables aggregated into one signle variable.
Preperation in strucutred data: Filtration
Eliminate data rows which are not needed.
[We filter out the information that is not relevant]
CFA Lv 2 Candidates only
Preperation in strucutred data: Selection
Columns that can be eliminated
Preperation in strucutred data: Conversion
Nominal, ordinal, integer, ratio, categorical.
Cleansing strucutred data: Incomplete
Missing entries
Cleansing strucutred data: Invalid
Outside a meaningful range
Cleansing strucutred data: Inconsistent
Some data conflicts with other data.
Cleansing strucutred data: Inaccurate
Not a true value
Cleansing strucutred data: non-uniform
Non identical data format
American date (M/D/Y) vs European (D/M/Y)
Cleansing strucutred data: Duplication
Multiple identical observation
Adjusting the range of a feature: Normalization
Rescales in the rage 0-1
Sensitive to outliers.
Xi- Xmin /(Range)
Xi- Xmin /(Xmax -Xmin)
Adjusting the range of a feature: Standardization
Centers and Rescales
Requiers normal distribution
(Xi - u) / Standard deviation
Performance evaluation graph: Precision formula
P= TP / (TP + FP)
Remeber: Demoninator ( Positive)
Useful when type 1 error is high
is the ratio of correctly predictive positive classes to all predictive positive classes.
Precision is useful in situations where the cost of FP or Type I Error is high.
For example, when an expensive product fails quality inspection (predicted class 1) and is
scrapped, but it is actually perfectly good (actual class 0).
Performance evaluation graph: Recall formula
TP / (TP + FN)
Remember: ( Recall we have the opposite in the denominator)
Sensitivity: useful when type 2 error is high.
also known as sensitivity i.e. is the ratio of correctly predicted positive classes to all actual
positive classes. Recall is useful in situations where the cost of FN or Type II Error is high.
For example, when an expensive product passes quality inspection (predicted class 0) and
is sent to the valued customer, but it is actually quite defective (actual class 1)
Performance evaluation graph: Accuracy formula
(TP + TN) / (TP + FN + TN + FP)
Is the percentage of correctly predicted classes out of total predictions.
Receiver operating characterisitcs: False Positive Rate Formula
FP / (FP + TN)
Statement / (Statement + Opposite)
Receiver operating characterisitcs: True Positive Rate Formula
TP / (TP + FN)
Statement / (Statement + Opposite)
In big data projects, which measure is the most appropriate for regression method
RMSE
(Root Mean Square Error)
What is “trimming” in big data projects?
Removing the bottom and top 1% of observation on a feature in a data set.
What is “Winsorization” in big data projects?
Replacing the extreme values in a data set with the same maximum or minumimum value
Confusion Matrix: F1 Score Formula
(2 x P x R) / (P + R)
is the harmonic mean of precision and recall.
F1 Score is more appropriate than Accuracy when unequal class distribution is in the dataset andit is necessary to measure the equilibrium of Precision and Recall.
High scores on both of these metrices suggest good model performance.