Exploratory Data Analysis Flashcards

1
Q

How to use Pandas and Numpy?

A

use Pandas to play and finally load into Numpy to feed into a ML algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data types

A

Numerical
Categorical
Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Numerical Data types

A

Discrete or Integer
- head count

Continues:
infinite precision
- How much rain fell on a given day?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Categorical Data

A

Gender
Political Parties

Orders don’t matter
doesn’t have an intrinsic numerical meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinal Data

A

Mixture of Categorical and Numericals

  • Movie Ratings (4 Star movie)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Normal Distribution vs Probability Mass Function

A

Continues vs Discrete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Poisson Distribution

A

works with Discrete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Binomial Distribution

A

Discrete data of boolean

Binomial distributions are used for binary classifications of discrete events, such as flipping a coin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bernoulli Distribution

A
Special case of binomial distribution
single trial (n=1)

Binomial Distribution is sort of sum of Bernoulli distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Additive model

A

Seasonality + Trends + Noise

constant seasonal variantion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Multiplicative model

A

Seasonal variation increase as the trend increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Amazon Athena

A

Presto under the hood

Serverless interactive queries of S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Athena Supported formats

A
CSV
JSON
ORC
Parquet
Avro

unstructured, semi or structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Can you integrate Athena with some notebooks ?

A

yes you can

  • Jupyter
  • Zeppelin
  • RStudio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Does Athena charge for DDL?

A

No not for DDL ( Create/Alter/Drop)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to save money on Athena?

A

use Columnar format

ORC, Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Cross account access in S3 is possible?

A

yes it is. tune a bucket policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Can you do CSE-KMS in S3 for Athena results at rest?

A

yes you can

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Athena Anti-Patterns?

A

Highly formatted reports / vizes
- try QuickSight

ETL
- use Glue ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

QuickSight?

A
Serverless
Redshift
Aurora / RDS
Athena
EC2
Files
- Excel
- CSV, TSV
- common log format

it does some limited ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

SPICE

A

accelerate interactive queries

in-memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

QuickSight ML capabilities?

A

Anomaly Detection
Forecasting
Auto-narratives

using RANDOM_CUT_FORREST

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

how to draw hierarchical Aggregations?

A

Tree Maps

24
Q

Different node types in EMR?

A

Master node

Core node
- store data on HDFS

Task node

  • Run the tasks and doesn’t host the data
  • spot instances sound good for this type
  • cluster can continue with/without it
25
Q

HDFS default size

A

128 mb

26
Q

How does S3 provides EMRFS Consistent view?

A

using DynamoDB

27
Q

Can you add/remove core/task nodes on the fly in EMR?

A

yes you can

28
Q

What’s Hadoop modules?

A

top: MapReduce - Spark
middle: YARN
button: HDFS

underlying all of this is, Hadoop Core (Hadoop Common)

29
Q

Spark

  • how does it work in EMR?
  • compare to map reduce?
  • API
A

Thanks to YARN spark can negotiate with HDFS

faster alternative to MapReduce

It uses DAG for managing dependencies and processing and schedule effectively

API for Java, Python, Scala and R

30
Q

Spark components

A

Spark SQL
MLLib
GraphX
Spark Streaming

Spark Core
Resilient Distributed Dataset (RDD)

31
Q

What is taking the place of the lower level Resilient Distributed Dataset ?

A

Dataframe in Python

Dataset in Scala

32
Q

MLLib capabilities

A

Classification
- Logistic Regression, Naive Bayes

Regression

Decision trees

Recommendation Engine (ALS)

Clustering (K-Means)

LDA (Topic modeling)

ML workflow utilities

SCV, PCA, Statistics

33
Q

What does Zeppelin bring to table with Spark?

A

Make spark like a data science tool

Run Spark code interactively (like in Spark Shell)

Execute SQL queries directly against SparkSQL

Visualize in charts and graphs

34
Q

Compare EMR Notebook with Zeppelin

A

EMR Notebook backed up to S3

more integration with AWS

Provision/terminate cluster within notebook

Hosted inside a VPC

Access only via AWS Console

No charge for EMR customers

35
Q

What does Kerberos provide?

A

Strong Authentication through secret key cryptography

36
Q

what are the curse of dimensionality?

A

too many features > Spars data

37
Q

Dimensionality Reduction methods

A

Principal Component Analysis (PCA)

K-Means

38
Q

Methods to impute data

A

Mean
- replace with Mean of entire column

Median
- if outliers are there then use Median

Most Frequent Value
- for categorical: e.g. use the

Copy
- Summary for Description

KNN for Numerical or Hamming Distance

Deep Learning
- good for categorical data

Regression

  • Linear or non-linear regression
  • MICE (Multiple imputation by Chained Equations)

Get more data

39
Q

Unbalanced data?

A

Oversampling

  • duplicate from minority class
  • can be done at random

Undersampling
- remove from majority class

SMOT

40
Q

SMOT

A

Synthetic Minority over-sampling Technique

Artificially generate new samples of the minority class using nearest neighbors

  • KNN
  • Create new samples from KNN
41
Q

Variance

A

Sigma square:

average of the squared differences from the mean

42
Q

Standard Deviation

A

Sigma:

Square root of variance

43
Q

How to spot outliers?

A

Whisk and Plot e.g. beyond 1.5 interquartile range

Standard deviation

AWS Proprietary algorithm, RANDOM_CUT_FOREST

44
Q

Binning

A

Bucket numerical values and make them categorical

cover up in-precisions, uncertainty or errors in measurements

Quantile Binning: even sizes in each bin

45
Q

Transforming

A

apply some functions to a feature for better training

features with an exponential trend may benefit from logarithmic transformation e.g. ln(x)

or x2 or sqrt(x)

46
Q

Encoding

A

in deep learning is common

one-hot encoding

  • create buckets for every category
  • bucket has 1 for category and 0 for others
47
Q

Scaling / Normalizing

A

imagine age range and income range !
normal and let them have an even-level playing field

basically to avoid giving more weight to larger magnitudes

Scikit-Learn has a pre-processor module (MinMaxScaler)

remember to scale back the result

48
Q

Shuffling

A

avoid learning from residual signals in the training resulting from the order in which they are collected

49
Q

SageMaker Ground Truth?

A

Manages humans to label data

efficient because start to develop its own model during the process to reduce the reliance on human

50
Q

Who are the labelers of SageMaker Ground Truth?

A

Mechanical Turk
Internal team
Professional labeling companies

51
Q

AWS Rekognition

A

Classify and feature extraction (tags) of images

52
Q

AWS Comprehend

A

generate sentiment or topics from texts

53
Q

TF-IDF

A

Term Frequency and Inverse Document Frequency

we can use the log of IDF since word frequencies are distributed exponentially

54
Q

unigram, bigram and ngrams ?

A
an extension of TF-IDF 
e.g.
I Love Certification
uni: I,Love,Certification
bigram: I Love, Love Certification
...
55
Q

for TF-IDF, what transformation we do?

A

Tokenize the content and then get the

Sparse Vector

56
Q

MICE?

A

Multiple Imputation by Chained Equations finds relationships between features and is one of the most advanced imputation methods available. Using machine learning techniques such as KNN and deep learning are also good approaches.