Exploratory Data Analysis Flashcards
Correct syntax for numpy, pandas, matplotlib and seaborn (54 cards)
Select columns (.loc) from DataFrame with ALL non-zeros
df.loc[ : , df.all( ) ]
Select columns (.loc) from DataFrame with ANY non-zeros
df.loc[ : , df.any( ) ]
Select columns (.loc) from DataFrame with ANY NaNs
df.loc [ : , df.isnull( ).any( ) ]
Select columns (.loc) from DataFrame with NO NaNs
df.loc [ : , df.notnull( ).all( ) ]
Drop rows with ANY NaNs
df.dropna(how = ‘any’)
Import local file.xlsx using pandas (as data)
data = pd.ExcelFile(file.xlsx) # print(data.sheet_names) # df = data.parse('sheetname') or (0)
Initial GET requests using urllib.
6 lines : import statements, url, request, response, read and close.
- from urllib.request import urlopen, Request
- url = “https://www.wikipedia.org”
- request = Request(url)
- response = urlopen(request)
- html = response.read( )
- response.close( )
Initial GET requests using requests.
4 lines : import, url, request, read.
- import requests
- url = “https://www.wikipedia.org”
- r = requests.get(url)
- text = r.text
Tidy Data: Principles. ( 3 )
- Columns represent separate variables containing values
- Rows represent individual observations
- Observational units form tables
Tidy Data: Melting and Pivoting
Turn analysis-friendly into report-friendly
Melting: turn columns into rows.
Pivoting: turn unique values into separate columns
Tidy Data: Melting syntax
pd.melt(frame=df, id_vars=’col-2b-fixed’, value_vars=[’ ‘,’ ‘ ], var_name=’name’, value_name=’name’)
Tidy Data: Pivoting syntax
df.pivot_table(values=’ ‘, index=’ ‘, columns=’ ‘, aggfunc=np.mean)
Change column-type from ‘object’ to ‘numeric’
df [ ‘object_col’ ] = pd.to_numeric ( df [ ‘object_col’ ], errors=’coerce’)
Change column-type to ‘category’
df [ ‘column’ ].astype( ‘category’ )
Plot idioms for DataFrames (3)
df. plot( kind=’hist’)
df. plt.hist( )
df. hist( )
Syntax for .loc accessor
df.loc [ ‘Row_Label’ ] [ ‘Col_Label’ ]
Syntax: sns barplot - use.
sns.barplot(x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
Syntax: sns countplot - use.
sns.countplot(x=’column’, data=df, hue=’category’)
Syntax: sns histogram plot - use.
sns.distplot ( df [ ‘continuous’ ], kde=False, bins=30 )
Syntax: sns scatterplot - use.
sns.lmplot ( x= ‘numerical’, y= ‘numerical’, data=df, hue=’category’, fit_reg=False )
Syntax: sns pairplot - use.
sns.pairplot ( df, hue=’categorical’, palette=’coolwarm’ )
sns categorical plots
sns. barplot(x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot( x= ‘categorical’, data= df )
sns. factorplot( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)
Syntax: sns heatmap - use.
sns.heatmap ( PivotTable, cmap=’ ‘, lw=’ ‘. lc=’ ‘ )
sns categorical plots
sns. barplot ( x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot ( x= ‘categorical’, data= df )
sns. factorplot ( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)