Exploratory Data Analysis Flashcards

(56 cards)

1
Q

How to use Pandas and Numpy?

A

use Pandas to play and finally load into Numpy to feed into a ML algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data types

A

Numerical
Categorical
Ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Numerical Data types

A

Discrete or Integer
- head count

Continues:
infinite precision
- How much rain fell on a given day?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Categorical Data

A

Gender
Political Parties

Orders don’t matter
doesn’t have an intrinsic numerical meaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ordinal Data

A

Mixture of Categorical and Numericals

  • Movie Ratings (4 Star movie)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Normal Distribution vs Probability Mass Function

A

Continues vs Discrete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Poisson Distribution

A

works with Discrete data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Binomial Distribution

A

Discrete data of boolean

Binomial distributions are used for binary classifications of discrete events, such as flipping a coin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Bernoulli Distribution

A
Special case of binomial distribution
single trial (n=1)

Binomial Distribution is sort of sum of Bernoulli distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Additive model

A

Seasonality + Trends + Noise

constant seasonal variantion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Multiplicative model

A

Seasonal variation increase as the trend increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Amazon Athena

A

Presto under the hood

Serverless interactive queries of S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Athena Supported formats

A
CSV
JSON
ORC
Parquet
Avro

unstructured, semi or structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Can you integrate Athena with some notebooks ?

A

yes you can

  • Jupyter
  • Zeppelin
  • RStudio
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Does Athena charge for DDL?

A

No not for DDL ( Create/Alter/Drop)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to save money on Athena?

A

use Columnar format

ORC, Parquet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Cross account access in S3 is possible?

A

yes it is. tune a bucket policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Can you do CSE-KMS in S3 for Athena results at rest?

A

yes you can

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Athena Anti-Patterns?

A

Highly formatted reports / vizes
- try QuickSight

ETL
- use Glue ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

QuickSight?

A
Serverless
Redshift
Aurora / RDS
Athena
EC2
Files
- Excel
- CSV, TSV
- common log format

it does some limited ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

SPICE

A

accelerate interactive queries

in-memory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

QuickSight ML capabilities?

A

Anomaly Detection
Forecasting
Auto-narratives

using RANDOM_CUT_FORREST

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

how to draw hierarchical Aggregations?

24
Q

Different node types in EMR?

A

Master node

Core node
- store data on HDFS

Task node

  • Run the tasks and doesn’t host the data
  • spot instances sound good for this type
  • cluster can continue with/without it
25
HDFS default size
128 mb
26
How does S3 provides EMRFS Consistent view?
using DynamoDB
27
Can you add/remove core/task nodes on the fly in EMR?
yes you can
28
What's Hadoop modules?
top: MapReduce - Spark middle: YARN button: HDFS underlying all of this is, Hadoop Core (Hadoop Common)
29
Spark - how does it work in EMR? - compare to map reduce? - API
Thanks to YARN spark can negotiate with HDFS faster alternative to MapReduce It uses DAG for managing dependencies and processing and schedule effectively API for Java, Python, Scala and R
30
Spark components
Spark SQL MLLib GraphX Spark Streaming Spark Core Resilient Distributed Dataset (RDD)
31
What is taking the place of the lower level Resilient Distributed Dataset ?
Dataframe in Python | Dataset in Scala
32
MLLib capabilities
Classification - Logistic Regression, Naive Bayes Regression Decision trees Recommendation Engine (ALS) Clustering (K-Means) LDA (Topic modeling) ML workflow utilities SCV, PCA, Statistics
33
What does Zeppelin bring to table with Spark?
Make spark like a data science tool Run Spark code interactively (like in Spark Shell) Execute SQL queries directly against SparkSQL Visualize in charts and graphs
34
Compare EMR Notebook with Zeppelin
EMR Notebook backed up to S3 more integration with AWS Provision/terminate cluster within notebook Hosted inside a VPC Access only via AWS Console No charge for EMR customers
35
What does Kerberos provide?
Strong Authentication through secret key cryptography
36
what are the curse of dimensionality?
too many features > Spars data
37
Dimensionality Reduction methods
Principal Component Analysis (PCA) | K-Means
38
Methods to impute data
Mean - replace with Mean of entire column Median - if outliers are there then use Median Most Frequent Value - for categorical: e.g. use the Copy - Summary for Description KNN for Numerical or Hamming Distance Deep Learning - good for categorical data Regression - Linear or non-linear regression - MICE (Multiple imputation by Chained Equations) Get more data
39
Unbalanced data?
Oversampling - duplicate from minority class - can be done at random Undersampling - remove from majority class SMOT
40
SMOT
Synthetic Minority over-sampling Technique Artificially generate new samples of the minority class using nearest neighbors - KNN - Create new samples from KNN
41
Variance
Sigma square: | average of the squared differences from the mean
42
Standard Deviation
Sigma: | Square root of variance
43
How to spot outliers?
Whisk and Plot e.g. beyond 1.5 interquartile range Standard deviation AWS Proprietary algorithm, RANDOM_CUT_FOREST
44
Binning
Bucket numerical values and make them categorical cover up in-precisions, uncertainty or errors in measurements Quantile Binning: even sizes in each bin
45
Transforming
apply some functions to a feature for better training features with an exponential trend may benefit from logarithmic transformation e.g. ln(x) or x2 or sqrt(x)
46
Encoding
in deep learning is common one-hot encoding - create buckets for every category - bucket has 1 for category and 0 for others
47
Scaling / Normalizing
imagine age range and income range ! normal and let them have an even-level playing field basically to avoid giving more weight to larger magnitudes Scikit-Learn has a pre-processor module (MinMaxScaler) remember to scale back the result
48
Shuffling
avoid learning from residual signals in the training resulting from the order in which they are collected
49
SageMaker Ground Truth?
Manages humans to label data efficient because start to develop its own model during the process to reduce the reliance on human
50
Who are the labelers of SageMaker Ground Truth?
Mechanical Turk Internal team Professional labeling companies
51
AWS Rekognition
Classify and feature extraction (tags) of images
52
AWS Comprehend
generate sentiment or topics from texts
53
TF-IDF
Term Frequency and Inverse Document Frequency we can use the log of IDF since word frequencies are distributed exponentially
54
unigram, bigram and ngrams ?
``` an extension of TF-IDF e.g. I Love Certification uni: I,Love,Certification bigram: I Love, Love Certification ... ```
55
for TF-IDF, what transformation we do?
Tokenize the content and then get the | Sparse Vector
56
MICE?
Multiple Imputation by Chained Equations finds relationships between features and is one of the most advanced imputation methods available. Using machine learning techniques such as KNN and deep learning are also good approaches.