Exploratory Data Analysis Flashcards

Correct syntax for numpy, pandas, matplotlib and seaborn (54 cards)

1
Q

Select columns (.loc) from DataFrame with ALL non-zeros

A

df.loc[ : , df.all( ) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Select columns (.loc) from DataFrame with ANY non-zeros

A

df.loc[ : , df.any( ) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Select columns (.loc) from DataFrame with ANY NaNs

A

df.loc [ : , df.isnull( ).any( ) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Select columns (.loc) from DataFrame with NO NaNs

A

df.loc [ : , df.notnull( ).all( ) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Drop rows with ANY NaNs

A

df.dropna(how = ‘any’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Import local file.xlsx using pandas (as data)

A
data = pd.ExcelFile(file.xlsx)
# print(data.sheet_names) 
# df = data.parse('sheetname') or (0)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Initial GET requests using urllib.

6 lines : import statements, url, request, response, read and close.

A
  1. from urllib.request import urlopen, Request
  2. url = “https://www.wikipedia.org”
  3. request = Request(url)
  4. response = urlopen(request)
  5. html = response.read( )
  6. response.close( )
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Initial GET requests using requests.

4 lines : import, url, request, read.

A
  1. import requests
  2. url = “https://www.wikipedia.org”
  3. r = requests.get(url)
  4. text = r.text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Tidy Data: Principles. ( 3 )

A
  1. Columns represent separate variables containing values
  2. Rows represent individual observations
  3. Observational units form tables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Tidy Data: Melting and Pivoting

A

Turn analysis-friendly into report-friendly
Melting: turn columns into rows.
Pivoting: turn unique values into separate columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Tidy Data: Melting syntax

A

pd.melt(frame=df, id_vars=’col-2b-fixed’, value_vars=[’ ‘,’ ‘ ], var_name=’name’, value_name=’name’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Tidy Data: Pivoting syntax

A

df.pivot_table(values=’ ‘, index=’ ‘, columns=’ ‘, aggfunc=np.mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Change column-type from ‘object’ to ‘numeric’

A

df [ ‘object_col’ ] = pd.to_numeric ( df [ ‘object_col’ ], errors=’coerce’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Change column-type to ‘category’

A

df [ ‘column’ ].astype( ‘category’ )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Plot idioms for DataFrames (3)

A

df. plot( kind=’hist’)
df. plt.hist( )
df. hist( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Syntax for .loc accessor

A

df.loc [ ‘Row_Label’ ] [ ‘Col_Label’ ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Syntax: sns barplot - use.

A

sns.barplot(x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Syntax: sns countplot - use.

A

sns.countplot(x=’column’, data=df, hue=’category’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Syntax: sns histogram plot - use.

A

sns.distplot ( df [ ‘continuous’ ], kde=False, bins=30 )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Syntax: sns scatterplot - use.

A

sns.lmplot ( x= ‘numerical’, y= ‘numerical’, data=df, hue=’category’, fit_reg=False )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Syntax: sns pairplot - use.

A

sns.pairplot ( df, hue=’categorical’, palette=’coolwarm’ )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

sns categorical plots

A

sns. barplot(x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot( x= ‘categorical’, data= df )
sns. factorplot( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)

23
Q

Syntax: sns heatmap - use.

A

sns.heatmap ( PivotTable, cmap=’ ‘, lw=’ ‘. lc=’ ‘ )

24
Q

sns categorical plots

A

sns. barplot ( x= ‘categorical’, y= ‘numerical’, data= df, estimator=np.mean)
sns. boxplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’)
sns. countplot ( x= ‘categorical’, data= df )
sns. factorplot ( x= ‘categorical’, y= ‘numerical’, data=df, kind= ‘bar’ (or = ‘point’, or = ‘violin’)
sns. stripplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, jitter=True, dodge=True
sns. violinplot ( x= ‘categorical’, y= ‘numerical’, data=df, hue=’categorical’, split= True)

25
sns categorical plots (6)
sns. barplot ( x= 'categorical', y= 'numerical', data= df, estimator=np.mean) sns. boxplot ( x= 'categorical', y= 'numerical', data=df, hue='categorical') sns. countplot ( x= 'categorical', data= df ) sns. factorplot ( x= 'categorical', y= 'numerical', data=df, kind= 'bar' (or = 'point', or = 'violin') sns. stripplot ( x= 'categorical', y= 'numerical', data=df, hue='categorical', jitter=True, dodge=True sns. violinplot ( x= 'categorical', y= 'numerical', data=df, hue='categorical', split= True)
26
sns distribution plots (4)
sns. distplot ( df [ 'continuous' ], kde=False, bins=30 ) sns. jointplot ( x= 'continuous', y= 'numerical', data=df ) sns. pairplot ( df, hue='categorical', palette='coolwarm' ) sns. rugplot ( df [ 'continuous' ] )
27
sns matrix plots (2)
sns. heatmap ( PivotTable, cmap=' ', lw=' '. lc=' ' ) sns. heatmap ( df.corr( ), annot = True ) sns.clustermap (PivotTable, cmap= ' ' )
28
sns PairGrid
g = sns.PairGrid ( df ) g. map ( plt.scatter ) g. map_diag ( sns.distplot ) g. map_upper ( plt.scatter ) g. map_lower ( sns.kdeplot )
29
sns FacetGrid
g = sns.FacetGrid ( data=df, col='category', row='category' ) g.map ( sns.distplot, 'numerical' )
30
sns lmplot (regression)
sns.lmplot ( x= 'numerical', y= 'numerical', data=df, hue='category', markers['o' , 'v'], scatter_kws={'s':100} )
31
Scatter plots
df. plot.scatter(x= 'col_A', y='col_B', color='col_C', size=df ['col_C']*100 ) plt. scatter( x, y) sns. lmplot ( x= 'numerical', y= 'numerical', data=df, hue='category', fit_reg=False ) df. iplot ( kind=scatter, x= 'col_A', y='col_B', mode='markers' )
32
Data Quality Control - find missing data - drop columns - drop rows
df. isnull( ) # returns bool - NaN = True | sns. heatmap ( df.isnull( ), yticklabels=False, cbar=False, cmap='viridis' )
33
Histogram plots
df [ 'continuous' ].hist(bins=20) df [ df [ 'column1' ] == 1 [ 'column2' ].hist ( bins=30, color='blue', label='label', alpha=0.5 )
34
Reproducible Data Analysis 1/10 Get data from web Jake Vanderplas
``` from urllib.request import urlretrieve URL = 'https://data.seattle.gov/api....' urlretrieve(URL, 'file.csv') data = pd.read_csv("file.csv", index_col='Date', parse_dates=True) data.resample('W').sum( ).plot( ) ```
35
Reproducible Data Analysis 2/10 Exploratory Data Analysis Jake Vanderplas
data.columns = [ 'West', 'East' ] data['Total'] = data['West'] + data['East'] ax = data.resample('D').sum( ).rolling(365).sum( ).plot( ) ax.set_ylim(0, None) data.groupby(data.index.time).mean( ).plot( ) pivoted = data.pivot_table('Total', index=data.index.time, columns=data.index.date) pivoted.plot(legend=False, alpha=0.01) # line for every day
36
Reproducible Data Analysis 3/10 Version control with Git & GitHub Jake Vanderplas
``` https://github.com Create new repository (Name, description, README, .gitignore(Python), MIT License) Copy Clone or download link Terminal window: git clone mv JupyterNotebook.ipynb into git folder git status git add JupyterNotebook.ipynb git commit -m "Add initial analysis" git push origin master ``` open JupyterNotebook.ipynb from correct location git status > file.csv vim .gitignore > # data > file.csv
37
Reproducible Data Analysis 4/10 Working with Data and GitHub Jake Vanderplas
import os from urllib.request import urlretrieve URL = 'https://data.seattle.gov/api....' def get_file(filename='file.csv', url=URL, force_download=False): if force_download or not os.path.exists(filename): urlretrieve(url, filename) data = pd.read_csv("file.csv", index_col='Date', parse_dates=True) data.columns = [ 'West', 'East' ] data['Total'] = data['West'] + data['East'] return data data = get_file( )
38
Reproducible Data Analysis 5/10 Creating a Python package Jake Vanderplas
Terminal window mkdir jupyterworkflow touch jupyterworkflow/__init__.py vim jupyterworkflow/data.py ``` """ Download and cache the data Parameters: filename : string (optional) location to save the data url : string (optional) web location force_download : bool (optional) Returns """ < replace 4/10 with following > from jupyterworkflow.data import get_file ```
39
Confusion matrix
True pos (tp). False pos (fp). False neg (fn ). True neg (tn).
40
Confusion matrix Accuracy =
Fraction of correct predictions Accuracy = correct / total = tp + tn/ tp fp tn fn
41
Confusion matrix Precision =
How accurate positive predictions were. tp / tp + fp
42
Confusion matrix Recall
What fraction of positives the model identified. tp / tp + fn
43
Confusion matrix F1 score
The harmonic mean of precision and recall- lies between them 2 * prec * recall / prec + recall
44
Model trade-off between precision and recall.
Too many “yes” gives high fp- high recall, low precision Too few “yes” gives high fn- low recall, high precision.
45
Input feature categories Naive Bayes classifier
Suited to yes or no features
46
Input feature categories Regression models
Numerical features
47
Input feature categories Decision tree
Numeric or categorical features
48
Input feature categories SVM
Numerical features
49
Common way to analyze the relationship between a categorical feature and a continuous feature
Boxplot
50
Check for null values in the dataset.
print( df.isnull ( ).values.sum( ) )
51
Check column-wise distribution of null values
print( df.isnull ( ).sum( ) )
52
Frequency distribution of categories within a feature
print(df['category_col'].value_counts( ) )
53
Dictionary comprehension to map category strings to numeric values. eg. {'carrier': {'AA': 1, 'OO': 7, 'DL': 4, 'F9': 5, 'B6': 3, 'US': 9, 'AS': 2, 'WN': 11, 'VX': 10, 'HA': 6, 'UA': 8}}
``` labels = df['category_col'].astype('category').cat.categories.tolist( ) replace = {'category_col' : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}} ``` print(replace)
54
Categorical Data - 3 types
Nominal: No intrinsic order Ordinal: Ordered or ranked. Dichotomous: Nominal with only 2 categories