Exploratory Data Analysis Flashcards

Question

sns categorical plots (6)

Answer 1

sns. barplot ( x= 'categorical', y= 'numerical', data= df, estimator=np.mean) sns. boxplot ( x= 'categorical', y= 'numerical', data=df, hue='categorical') sns. countplot ( x= 'categorical', data= df ) sns. factorplot ( x= 'categorical', y= 'numerical', data=df, kind= 'bar' (or = 'point', or = 'violin') sns. stripplot ( x= 'categorical', y= 'numerical', data=df, hue='categorical', jitter=True, dodge=True sns. violinplot ( x= 'categorical', y= 'numerical', data=df, hue='categorical', split= True)

Answer 2

sns. distplot ( df [ 'continuous' ], kde=False, bins=30 ) sns. jointplot ( x= 'continuous', y= 'numerical', data=df ) sns. pairplot ( df, hue='categorical', palette='coolwarm' ) sns. rugplot ( df [ 'continuous' ] )

Answer 3

sns. heatmap ( PivotTable, cmap=' ', lw=' '. lc=' ' ) sns. heatmap ( df.corr( ), annot = True ) sns.clustermap (PivotTable, cmap= ' ' )

Answer 4

g = sns.PairGrid ( df ) g. map ( plt.scatter ) g. map_diag ( sns.distplot ) g. map_upper ( plt.scatter ) g. map_lower ( sns.kdeplot )

Answer 5

g = sns.FacetGrid ( data=df, col='category', row='category' ) g.map ( sns.distplot, 'numerical' )

Answer 6

sns.lmplot ( x= 'numerical', y= 'numerical', data=df, hue='category', markers['o' , 'v'], scatter_kws={'s':100} )

Answer 7

df. plot.scatter(x= 'col_A', y='col_B', color='col_C', size=df ['col_C']*100 ) plt. scatter( x, y) sns. lmplot ( x= 'numerical', y= 'numerical', data=df, hue='category', fit_reg=False ) df. iplot ( kind=scatter, x= 'col_A', y='col_B', mode='markers' )

Answer 8

df. isnull( ) # returns bool - NaN = True | sns. heatmap ( df.isnull( ), yticklabels=False, cbar=False, cmap='viridis' )

Answer 9

df [ 'continuous' ].hist(bins=20) df [ df [ 'column1' ] == 1 [ 'column2' ].hist ( bins=30, color='blue', label='label', alpha=0.5 )

Answer 10

``` from urllib.request import urlretrieve URL = 'https://data.seattle.gov/api....' urlretrieve(URL, 'file.csv') data = pd.read_csv("file.csv", index_col='Date', parse_dates=True) data.resample('W').sum( ).plot( ) ```

Answer 11

data.columns = [ 'West', 'East' ] data['Total'] = data['West'] + data['East'] ax = data.resample('D').sum( ).rolling(365).sum( ).plot( ) ax.set_ylim(0, None) data.groupby(data.index.time).mean( ).plot( ) pivoted = data.pivot_table('Total', index=data.index.time, columns=data.index.date) pivoted.plot(legend=False, alpha=0.01) # line for every day

Answer 12

``` https://github.com Create new repository (Name, description, README, .gitignore(Python), MIT License) Copy Clone or download link Terminal window: git clone mv JupyterNotebook.ipynb into git folder git status git add JupyterNotebook.ipynb git commit -m "Add initial analysis" git push origin master ``` open JupyterNotebook.ipynb from correct location git status > file.csv vim .gitignore > # data > file.csv

Answer 13

import os from urllib.request import urlretrieve URL = 'https://data.seattle.gov/api....' def get_file(filename='file.csv', url=URL, force_download=False): if force_download or not os.path.exists(filename): urlretrieve(url, filename) data = pd.read_csv("file.csv", index_col='Date', parse_dates=True) data.columns = [ 'West', 'East' ] data['Total'] = data['West'] + data['East'] return data data = get_file( )

Answer 14

Terminal window mkdir jupyterworkflow touch jupyterworkflow/__init__.py vim jupyterworkflow/data.py ``` """ Download and cache the data Parameters: filename : string (optional) location to save the data url : string (optional) web location force_download : bool (optional) Returns """ < replace 4/10 with following > from jupyterworkflow.data import get_file ```

Answer 15

True pos (tp). False pos (fp). False neg (fn ). True neg (tn).

Answer 16

Fraction of correct predictions Accuracy = correct / total = tp + tn/ tp fp tn fn

Answer 17

How accurate positive predictions were. tp / tp + fp

Answer 18

What fraction of positives the model identified. tp / tp + fn

Answer 19

The harmonic mean of precision and recall- lies between them 2 * prec * recall / prec + recall

Answer 20

Too many “yes” gives high fp- high recall, low precision Too few “yes” gives high fn- low recall, high precision.

Answer 21

Suited to yes or no features

Answer 22

Numerical features

Answer 23

Numeric or categorical features

Answer 24

Numerical features

Answer 25

print( df.isnull ( ).values.sum( ) )

Answer 26

print( df.isnull ( ).sum( ) )

Answer 27

print(df['category_col'].value_counts( ) )

Answer 28

``` labels = df['category_col'].astype('category').cat.categories.tolist( ) replace = {'category_col' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}} ``` print(replace)

Answer 29

Nominal: No intrinsic order Ordinal: Ordered or ranked. Dichotomous: Nominal with only 2 categories

Exploratory Data Analysis Flashcards

Correct syntax for numpy, pandas, matplotlib and seaborn (54 cards)