Exploratory Data Analysis Flashcards

Question

HDFS default size

Answer 1

using DynamoDB

Answer 2

yes you can

Answer 3

top: MapReduce - Spark middle: YARN button: HDFS underlying all of this is, Hadoop Core (Hadoop Common)

Answer 4

Thanks to YARN spark can negotiate with HDFS faster alternative to MapReduce It uses DAG for managing dependencies and processing and schedule effectively API for Java, Python, Scala and R

Answer 5

Spark SQL MLLib GraphX Spark Streaming Spark Core Resilient Distributed Dataset (RDD)

Answer 6

Dataframe in Python | Dataset in Scala

Answer 7

Classification - Logistic Regression, Naive Bayes Regression Decision trees Recommendation Engine (ALS) Clustering (K-Means) LDA (Topic modeling) ML workflow utilities SCV, PCA, Statistics

Answer 8

Make spark like a data science tool Run Spark code interactively (like in Spark Shell) Execute SQL queries directly against SparkSQL Visualize in charts and graphs

Answer 9

EMR Notebook backed up to S3 more integration with AWS Provision/terminate cluster within notebook Hosted inside a VPC Access only via AWS Console No charge for EMR customers

Answer 10

Strong Authentication through secret key cryptography

Answer 11

too many features > Spars data

Answer 12

Principal Component Analysis (PCA) | K-Means

Answer 13

Mean - replace with Mean of entire column Median - if outliers are there then use Median Most Frequent Value - for categorical: e.g. use the Copy - Summary for Description KNN for Numerical or Hamming Distance Deep Learning - good for categorical data Regression - Linear or non-linear regression - MICE (Multiple imputation by Chained Equations) Get more data

Answer 14

Oversampling - duplicate from minority class - can be done at random Undersampling - remove from majority class SMOT

Answer 15

Synthetic Minority over-sampling Technique Artificially generate new samples of the minority class using nearest neighbors - KNN - Create new samples from KNN

Answer 16

Sigma square: | average of the squared differences from the mean

Answer 17

Sigma: | Square root of variance

Answer 18

Whisk and Plot e.g. beyond 1.5 interquartile range Standard deviation AWS Proprietary algorithm, RANDOM_CUT_FOREST

Answer 19

Bucket numerical values and make them categorical cover up in-precisions, uncertainty or errors in measurements Quantile Binning: even sizes in each bin

Answer 20

apply some functions to a feature for better training features with an exponential trend may benefit from logarithmic transformation e.g. ln(x) or x2 or sqrt(x)

Answer 21

in deep learning is common one-hot encoding - create buckets for every category - bucket has 1 for category and 0 for others

Answer 22

imagine age range and income range ! normal and let them have an even-level playing field basically to avoid giving more weight to larger magnitudes Scikit-Learn has a pre-processor module (MinMaxScaler) remember to scale back the result

Answer 23

avoid learning from residual signals in the training resulting from the order in which they are collected

Answer 24

Manages humans to label data efficient because start to develop its own model during the process to reduce the reliance on human

Answer 25

Mechanical Turk Internal team Professional labeling companies

Answer 26

Classify and feature extraction (tags) of images

Answer 27

generate sentiment or topics from texts

Answer 28

Term Frequency and Inverse Document Frequency we can use the log of IDF since word frequencies are distributed exponentially

Answer 29

``` an extension of TF-IDF e.g. I Love Certification uni: I,Love,Certification bigram: I Love, Love Certification ... ```

Answer 30

Tokenize the content and then get the | Sparse Vector

Answer 31

Multiple Imputation by Chained Equations finds relationships between features and is one of the most advanced imputation methods available. Using machine learning techniques such as KNN and deep learning are also good approaches.

Exploratory Data Analysis Flashcards

(56 cards)