Exploratory Data Analysis Flashcards

Correct syntax for numpy, pandas, matplotlib and seaborn

1
Q

Select columns (.loc) from DataFrame with ALL non-zeros

A

df.loc[ : , df.all( ) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Select columns (.loc) from DataFrame with ANY non-zeros

A

df.loc[ : , df.any( ) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Select columns (.loc) from DataFrame with ANY NaNs

A

df.loc [ : , df.isnull( ).any( ) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Select columns (.loc) from DataFrame with NO NaNs

A

df.loc [ : , df.notnull( ).all( ) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Drop rows with ANY NaNs

A

df.dropna(how = ‘any’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Import local file.xlsx using pandas (as data)

A
data = pd.ExcelFile(file.xlsx)
# print(data.sheet_names) 
# df = data.parse('sheetname') or (0)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Initial GET requests using urllib.

6 lines : import statements, url, request, response, read and close.

A
  1. from urllib.request import urlopen, Request
  2. url = “https://www.wikipedia.org”
  3. request = Request(url)
  4. response = urlopen(request)
  5. html = response.read( )
  6. response.close( )
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Initial GET requests using requests.

4 lines : import, url, request, read.

A
  1. import requests
  2. url = “https://www.wikipedia.org”
  3. r = requests.get(url)
  4. text = r.text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Tidy Data: Principles. ( 3 )

A
  1. Columns represent separate variables containing values
  2. Rows represent individual observations
  3. Observational units form tables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Tidy Data: Melting and Pivoting

A

Turn analysis-friendly into report-friendly
Melting: turn columns into rows.
Pivoting: turn unique values into separate columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Tidy Data: Melting syntax

A

pd.melt(frame=df, id_vars=’col-2b-fixed’, value_vars=[’ ‘,’ ‘ ], var_name=’name’, value_name=’name’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Tidy Data: Pivoting syntax

A

df.pivot_table(values=’ ‘, index=’ ‘, columns=’ ‘, aggfunc=np.mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Change column-type from ‘object’ to ‘numeric’

A

df [ ‘object_col’ ] = pd.to_numeric ( df [ ‘object_col’ ], errors=’coerce’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Change column-type to ‘category’

A

df [ ‘column’ ].astype( ‘category’ )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Plot idioms for DataFrames (3)

A

df. plot( kind=’hist’)
df. plt.hist( )
df. hist( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Syntax for .loc accessor

A

df.loc [ ‘Row_Label’ ] [ ‘Col_Label’ ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Syntax: sns barplot - use.

A

sns.barplot(x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Syntax: sns countplot - use.

A

sns.countplot(x=’column’, data=df, hue=’category’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Syntax: sns histogram plot - use.

A

sns.distplot ( df [ ‘continuous’ ], kde=False, bins=30 )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Syntax: sns scatterplot - use.

A

sns.lmplot ( x= ‘numerical’, y= ‘numerical’, data=df, hue=’category’, fit_reg=False )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Syntax: sns pairplot - use.

A

sns.pairplot ( df, hue=’categorical’, palette=’coolwarm’ )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

sns categorical plots

A

sns. barplot(x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot( x= ‘categorical’, data= df )
sns. factorplot( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)

23
Q

Syntax: sns heatmap - use.

A

sns.heatmap ( PivotTable, cmap=’ ‘, lw=’ ‘. lc=’ ‘ )

24
Q

sns categorical plots

A

sns. barplot ( x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot ( x= ‘categorical’, data= df )
sns. factorplot ( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)

25
Q

sns categorical plots (6)

A

sns. barplot ( x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot ( x= ‘categorical’, data= df )
sns. factorplot ( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)

26
Q

sns distribution plots (4)

A

sns. distplot ( df [ ‘continuous’ ], kde=False, bins=30 )
sns. jointplot ( x= ‘continuous’, y= ‘numerical’, data=df )
sns. pairplot ( df, hue=’categorical’, palette=’coolwarm’ )
sns. rugplot ( df [ ‘continuous’ ] )

27
Q

sns matrix plots (2)

A

sns. heatmap ( PivotTable, cmap=’ ‘, lw=’ ‘. lc=’ ‘ )
sns. heatmap ( df.corr( ), annot = True )

sns.clustermap (PivotTable, cmap= ‘ ‘ )

28
Q

sns PairGrid

A

g = sns.PairGrid ( df )

g. map ( plt.scatter )
g. map_diag ( sns.distplot )
g. map_upper ( plt.scatter )
g. map_lower ( sns.kdeplot )

29
Q

sns FacetGrid

A

g = sns.FacetGrid ( data=df, col=’category’, row=’category’ )
g.map ( sns.distplot, ‘numerical’ )

30
Q

sns lmplot (regression)

A

sns.lmplot ( x= ‘numerical’, y= ‘numerical’, data=df, hue=’category’, markers[‘o’ , ‘v’], scatter_kws={‘s’:100} )

31
Q

Scatter plots

A

df. plot.scatter(x= ‘col_A’, y=’col_B’, color=’col_C’, size=df [‘col_C’]*100 )
plt. scatter( x, y)
sns. lmplot ( x= ‘numerical’, y= ‘numerical’, data=df, hue=’category’, fit_reg=False )
df. iplot ( kind=scatter, x= ‘col_A’, y=’col_B’, mode=’markers’ )

32
Q

Data Quality Control

  • find missing data
  • drop columns
  • drop rows
A

df. isnull( ) # returns bool - NaN = True

sns. heatmap ( df.isnull( ), yticklabels=False, cbar=False, cmap=’viridis’ )

33
Q

Histogram plots

A

df [ ‘continuous’ ].hist(bins=20)

df [ df [ ‘column1’ ] == 1 [ ‘column2’ ].hist ( bins=30, color=’blue’, label=’label’, alpha=0.5 )

34
Q

Reproducible Data Analysis 1/10
Get data from web
Jake Vanderplas

A
from urllib.request import urlretrieve
URL = 'https://data.seattle.gov/api....'
urlretrieve(URL, 'file.csv')
data = pd.read_csv("file.csv", index_col='Date', parse_dates=True)
data.resample('W').sum( ).plot( )
35
Q

Reproducible Data Analysis 2/10
Exploratory Data Analysis
Jake Vanderplas

A

data.columns = [ ‘West’, ‘East’ ]
data[‘Total’] = data[‘West’] + data[‘East’]
ax = data.resample(‘D’).sum( ).rolling(365).sum( ).plot( )
ax.set_ylim(0, None)
data.groupby(data.index.time).mean( ).plot( )
pivoted = data.pivot_table(‘Total’, index=data.index.time, columns=data.index.date)
pivoted.plot(legend=False, alpha=0.01) # line for every day

36
Q

Reproducible Data Analysis 3/10
Version control with Git & GitHub
Jake Vanderplas

A
https://github.com
Create new repository (Name, description, README, .gitignore(Python), MIT License)
Copy Clone or download link
Terminal window: git clone 
mv JupyterNotebook.ipynb into git folder
git status
git add JupyterNotebook.ipynb
git commit -m "Add initial analysis"
git push origin master

open JupyterNotebook.ipynb from correct location
git status > file.csv
vim .gitignore > # data > file.csv

37
Q

Reproducible Data Analysis 4/10
Working with Data and GitHub
Jake Vanderplas

A

import os
from urllib.request import urlretrieve
URL = ‘https://data.seattle.gov/api….’

def get_file(filename=’file.csv’, url=URL, force_download=False):
if force_download or not os.path.exists(filename):
urlretrieve(url, filename)
data = pd.read_csv(“file.csv”, index_col=’Date’, parse_dates=True)
data.columns = [ ‘West’, ‘East’ ]
data[‘Total’] = data[‘West’] + data[‘East’]
return data

data = get_file( )

38
Q

Reproducible Data Analysis 5/10
Creating a Python package
Jake Vanderplas

A

Terminal window
mkdir jupyterworkflow
touch jupyterworkflow/__init__.py
vim jupyterworkflow/data.py

""" Download and cache the data
Parameters: 
filename : string (optional)
    location to save the data
url : string (optional)
    web location
force_download : bool (optional)
Returns
"""
< replace 4/10 with following >
from jupyterworkflow.data import get_file
39
Q

Confusion matrix

A

True pos (tp). False pos (fp).

False neg (fn ). True neg (tn).

40
Q

Confusion matrix

Accuracy =

A

Fraction of correct predictions

Accuracy = correct / total
= tp + tn/ tp fp tn fn

41
Q

Confusion matrix

Precision =

A

How accurate positive predictions were.

tp / tp + fp

42
Q

Confusion matrix

Recall

A

What fraction of positives the model identified.

tp / tp + fn

43
Q

Confusion matrix

F1 score

A

The harmonic mean of precision and recall- lies between them

2 * prec * recall / prec + recall

44
Q

Model trade-off between precision and recall.

A

Too many “yes” gives high fp- high recall, low precision

Too few “yes” gives high fn- low recall, high precision.

45
Q

Input feature categories Naive Bayes classifier

A

Suited to yes or no features

46
Q

Input feature categories Regression models

A

Numerical features

47
Q

Input feature categories Decision tree

A

Numeric or categorical features

48
Q

Input feature categories SVM

A

Numerical features

49
Q

Common way to analyze the relationship between a categorical feature and a continuous feature

A

Boxplot

50
Q

Check for null values in the dataset.

A

print( df.isnull ( ).values.sum( ) )

51
Q

Check column-wise distribution of null values

A

print( df.isnull ( ).sum( ) )

52
Q

Frequency distribution of categories within a feature

A

print(df[‘category_col’].value_counts( ) )

53
Q

Dictionary comprehension to map category strings to numeric values.

eg.
{‘carrier’: {‘AA’: 1, ‘OO’: 7, ‘DL’: 4, ‘F9’: 5, ‘B6’: 3, ‘US’: 9, ‘AS’: 2, ‘WN’: 11, ‘VX’: 10, ‘HA’: 6, ‘UA’: 8}}

A
labels = df['category_col'].astype('category').cat.categories.tolist( )
replace = {'category_col' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}

print(replace)

54
Q

Categorical Data - 3 types

A

Nominal: No intrinsic order
Ordinal: Ordered or ranked.
Dichotomous: Nominal with only 2 categories