Python & Plots Flashcards

(58 cards)

1
Q

Package

A

A collection of modules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Library

A

A collection of packages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Module

A

A bunch of related code saved in a file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Framework

A

A collection of modules and packages that contain the basic flow and architecture of an application 

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pandas

A

Open source python package used to manipulate and analyse tabular data. Built on numpy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Scatter plots

A

Great for viewing unordered data points

inflation_unemploy.plot(kind='scatter', x='unemployment_rate', y='cpi')

sns.scatterplot(x = "age", y = "value", size = "mpg", data = valuation)

sns.jointplot(x = 'age', y = 'value', data = valuation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Line plots

A

Great for viewing ordered data points

dow_bond.plot(kind='line', x='date',y=['close_dow', 'close_bond'], rot=90)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bar Charts

A

Great for viewing categorical data

  • Bar plots cannot display logarithms because they need to start at 0 and the log of 0 is undefined.

Horizontal Bar Plots

df.plot.barh(x=’val’,y=’lab’)

OR

sns.barplot(x=”val”, y=”lab”, data=df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Histogram plot

A

Great for visualising the distribution of values in a data set.

The data is chunked into bins and the data falls into each bins.

dog_pack[dog_pack["sex"]=="F"]["height_cm"].hist ()

To draw multiple histograms
~~~
dogs[[“height_cm”, “weight_kg”]].hist()
~~~

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Series

A

A one dimensional array, more than one make a data frame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pandas LOC

A

Df.loc [string ]

Df.loc[row], [col]]

A single bracket gives you a series and a double bracket gives you a Df

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Pandas iloc

A

Df.iloc[[1]]
Is used for integer-location based indexing.

print(df.iloc[:, 1:])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Box plot

A

Used to compare the distribution of continuous variables for each category 

  • Answers questions about the spread of variables.
  • In a box plot, sorting by the IQR makes it easier to answer questions about how much variation there was among the “typical” population.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Numpy Comparisons

A
* logical_and ()
* logical_or()
* logical_not ()

np. logical_and (bmi > 21, bmi < 22)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Enumerate a list

A
fam = [1.73, 1.68, 1.71, 1.89]
for index, height in enumerate(fam) :
print("index " + str (index) + ": " + str (height))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Looping in dictionaries

A
world = { "afghanistan": 30.55,
"albania":2.77,
"algeria":39.21 }

for key, value in world.items () :
print (key + " - -- " + str (value))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Looping 2d arrays

A
import numpy as np
np_height = np.array (l1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array ([65.4, 59.2, 63.6, 88.4, 68.71)
meas = np.array ([p_height, np_weight])

for val in np.nditer (meas) :
print(val)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Looping pandas df

A
import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)

for lab, row in brics.iterrows:
print (lab)
print (row)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Pandas apply

A
  • Can be used to add a new column and apply some logic to it, it’s more efficient than a loop
apply
dfloop.py
import pandas as pd
brics = pd.read_csv ("brics.csv", index_col = 0)

brics ["name_length"] = brics["country"].apply (Len)
print(brics)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In panadas, what do the following functions do?

  • .head()
  • .info()
  • .shape
  • .describe()
A
  • .head() returns the first few rows (the “head” of the DataFrame).
  • .info() shows information on each of the columns, such as the data type and number of missing values.
  • .shape returns the number of rows and columns of the DataFrame.
  • .describe() calculates a few summary statistics for each column.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

In pandas, what do the following functions do :

  • .values
  • .columns
  • .index
A
  • .values: A two-dimensional NumPy array of values.
  • .columns: An index of columns: the column names.
  • .index: An index for the rows: either row numbers or row names.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do you drop duplicates in pandas?

A
unique_dogs = vet_visits.drop_duplicates (subset= ["name", "breed"])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you count values in a column in pandas?

A
unique_dogs ["breed"].value_counts ()
unique_dogs ["breed"].value_counts(sort=True)
s.value_counts(normalize=True) = returns porportion of total
s.value_counts(normalize=True).sort_index()

The normalize transforms the result into percentages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do you sort values in a column in pandas?

A
df.sort_valves ("breed")
df.sort_values (["breed", "weight_kg"1)

result = df.sort_values('salary', ascending = True)

DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)[source]
25
How to filter rows in pandas
``` test = test [test['state'].isin(canu)] ```
26
How to use the agg function in pandas
``` def pct30 (column): return column. quantile (0.3) dogs["weight_kg"].agg([pct30, pct40]) dogs ["weight_kg"].agg(pct30) dogs [["weight_kg", "height_cm"]].agg (pct30) Group the results by title then count the number of accounts counted_df = licenses_owners.groupby ('title'). agg ({'account' : 'count '}) ```
27
subset + aggregation in pandas
``` ales_C = sales [sales ['type'] == "C"][ 'weekly_sales '].sum() ``` subset without aggregation: `avocados[avocados["type"] == "conventional"]["avg_price"]` Returns only the price column for matching types #Count the number of rows in the budget column that are missing `number_of_missing_fin = movies_financials ['budget']. isnull() . sum`
28
How to group by in pandas
``` avg_weight_by_breed = dog_pack.groupby ("breed") ["weight_kg"].mean () ```
29
How do you dectect missing values in a df
``` df.isna().any () ``` Plotting missing values ``` import matplotlib.pyplot as plt dogs.isna ().sum().plot(kind="bar") plt.show () ``` Dealing with missing values ``` dogs.dropna() dogs.fillna(0) ```
30
Create a df by using list of dictionaries - row by row
``` list_of_dicts = [ {"name": "Ginger", "breed": "Dachshund", "height_cm": 22, "weight_kg": 10, "date_of_birth": "2019-03-14"}, {"name": "Scout", "breed": "Dalmatian", "height_cm": 59, "weight_kg": 25, "date_of_birth": "2019-05-09"} new_dogs = pd. DataFrame (list_of_dicts) print (new_dogs) ```
31
Create a df by using dictionary of lists - by column
``` dict_of_lists = { "name": ["Ginger", "Scout"], "breed": ["Dachshund", "Dalmatian"], "height_cm": [22, 591, "weight_kg": [10, 25], "date_of_birth": ["2019-03-14", 12019-05-09"7 } new_dogs = pd.DataFrame (dict_of_lists) ```
32
Join/Merging in Pandas
``` wards_census = wards.merge (census, on='ward', suffixes= ('_ward', '_cen')) ``` Multiple tables ``` grants_licenses_ward = grants.merge (licenses, on=['address', 'zip']) \ .merge (wards, on='ward', suffixes= ('_bus', '_ward')) grants_licenses_ward.head() ``` Merge with left join ``` movies_taglines = movies.merge (taglines, on='id', how='left') print (movies_taglines.head()) ``` Different columns ``` movies_and_scifi_only = movies. merge (scifi_only, how='inner' Left_on= id, right_on='movie_id") ``` An indicator to know source table and comminalities (left_only, both, right_only) ``` genres_tracks = genres.merge (top_tracks, on='gid', how='left', indicator=True) ```
33
Concat Dataframes
``` pd.concat([dfA, dfB), ignore_index = True,] ``` Concat tables with different column names ``` pd.concat([inv_jan, inv_feb], sort=True) ``` OR `pd. concat (linv_jan, inv_feb], join='inner')`
34
Appending Tables
``` inv_jan.append([inv_feb, inv_mar], ignore_index=True, sort=True) ```
35
Valindating merges in pandas
Validating merges .merge (validate=None) `one__to__one, many_to_one, many_to_many` Verifying concatenations .concat (verify_integrity=False) : * Check whether the new concatenated **index** contains duplicates * Default value is False
36
How to check if a column in dfA has the same values of a column in dfB - Filtering
`popular_classic = classic_18_19 [classic_18_19[ 'tid'].isin(classic_pop['tid'l)]`
37
merge_ordered()
* Great for time series data * * Column(s) to join = on , left_on, and right_on * Type of join * how (left, right, inner, outer) * default outer * Overlapping column names * suffixes * Calling the function * pd.merge_ordered (df1, df2) * fill_method = 1) ffill (forward fill) fills data from previous field in previous row(above it)
38
Using merge_asof()
* Similar to a merge_ordered ( )left join * Match on the nearest **key column and not exact matches.** * Merged **"on" columns must be sorted.** ``` pd.merge_asof (visa, ibm, on=['date_time'], suffixes= (' _visa','_ibm'), direction='forward') ``` The defualt direction is backwards, but you can also choose *forward*, *nearest*
39
Query method in pandas
* query ('SOME SELECTION STATEMENT') * Accepts an input string * Input string used to determine what rows are returned * Input string similar to statement after WHERE clause in SQL statement `stocks.query ('nike >= 90')` `stocks_long.query('stock=="disney" or (stock=="nike" and close < 90)')` accessing an a date column thats an index `recent_gdp_pop = gdp_pivot.query ( date<= 11991-01-01" )`
40
Pivot
``` import numpy as np dogs.pivot_table (values="weight_kg", index="color", aggfunc=p.median) Filling missing values in pivot tables dogs.pivot_table (values="weight_kg", index="color", columns="breed", fill_value=0) ```
41
Melt
* To make analysis of data in table easier, we can reshape the data into a more computer-friendly form * Pandas.melt() unpivots a DataFrame from wide format to long format. `social_fin_tall = social_fin.melt(id_vars=['financial', 'company'])` You can melt certain values in a column. Here its only values in the financial column that are equal to 2018 & 2017 ``` social_fin_tall = social_fin.melt(id_vars= ['financial', 'company'], value_vars= ['2018', '2017'], var_name= ['year'], value_name='dollars', aggfunc = np.mean ) ``` think of the id_var as the columns you want to keep the same, the rest will be organised into proper rows
42
unifrom distribution
`from scipy.stats import uniform uniform.cdf (7, 0, 12)` its used in continuous distributions where we need to find the area under the graph for that distribution
43
seaborn plot
``` import matplotlib.pyplot as plt import seaborn as sns sns.scatterplot (x="total_bill", y="tip" data=tips, hue="smoker", hue_order= ["Yes", "No"7) plt.show() ``` OR ``` import matplotlib.pyplot as plt import seaborn as sns hue_colors = {"Yes": "black" "No": "red"} sns.scatterplot (x="total_bill", y="tip" data=tips, hue="smoker", palette=hue_ colors, size='size') plt.show () ``` Relationship plot and a few extra varaibles that it takes ``` import seaborn as sns import matplotlib.pyplot as plt sns.relplot (x="total_bill", y="tip", data=tips, kind="scatter" col="day", col_wrap=2, col_order= ["Thur", "Fri", "Sat", "Sun"1) plt.show () ``` different point style and transparency ``` Set alpha to be between 0 and 1 sns.relplot (×="total_bill", y="tip", data=tips, kind="scatter" alpha=0.4) ```
44
Seaborn line plots
Subgroups by location ``` import matplotlib.pyplot as plt import seaborn as sns sns.relplot (x="hour", y="NO_2_mean", data=air_df_Loc_mean, kind="line" style="location", hue="location") plt.show() ``` Multiple observations per x-value, this automatically plots confidence interval ``` import matplotlib.pyplot as plt import seaborn as sns sns.relplot (x="hour", y= "NO_2", data=air_df, kind="line") plt.show() ``` Plotting standard deviation ``` import matplotlib.pyplot as plt import seaborn as sns sns.relplot (×="hour", y="NO_2", data=air_df, kind="line" ci="sd") plt.show() ```
45
seaborn categorical plots
Line plot ``` import maplotlib.pyplot as plt import seaborn as sns category_order = ["No answer", "Not at all", "Not very", "Somewhat" "Very"] sns.catplot (x="how_masculine" data=masculinity_data, kind="count", order=category_order) plt.show() ``` Bar plots show the mean of quantitativ evariable per category BOX Plots sym changes the appearance of outliers ``` g = sns. catplot (x="time" y="total_bill", data=tips, kind="box" ,sym="") ``` to change whiskers, default is 1.5*iqr whis = 2, or whis = [5,95]
46
Seaborn point plots
* Points show mean of quantitative variable * Line plot has quantitative variable (usually time) on x-axis * Point plot has categorical variable on x-axis ``` import matplotlib.pyplot as plt import seaborn as sns sns.catplot(x="age" y="masculinity_important" , data=masculinity_data, hue="feel_masculine" kind="point", join=False, estimator=median (more robust to outliers) capsize=0.2, ) plt.show() ```
47
Seaborn Styles and colors
`sns.set_palette("RdBU")` `sns.set_style('whitegrid')` Changing the scale * Figure "context" changes the scale of the plot elements and labels * sns.set_context () * Smallest to largest: "paper", "notebook", "talk", "poster" FacetGrid vs. AxesSubplot objects Seaborn plots create two different types of objects: FacetGrid and AxesSubplot `g = sns.scatterplot(x="height", y="weight", data=df) type (g)` for facetgrid = g.fig.suptitle('title') for AxesSubplot = g.set_title('title) Titles for subplots ``` 9 = sns.catplot (x="Region", y="Birthrate", data=gdp_data, kind="box", col="Group") g.fig. suptitle ("New Title",y=1.03) g.set_titles("This is {col_name}") ``` Adding axis labels - works for both ``` g = sns. catplot (x="Region", y="Birthrate", data=gdp_data, kind="box") g.set (xlabel="New X Label", ylabel="New Y Label") plt.show () ``` Rotating ×-axis tick labels ``` g = sns.catplot (x="Region", y="Birthrate" data=gdp_data, kind="box") plt.xticks (rotation=90) plt.show() ```
48
Convert Series to string
``` jobs['roles'] = jobs['roles'].str.lower() ``` ``` print(contact.email.str.split('@', expand = True)) ``` ``` print(s.str.startswith('re')) ```
49
Pandas Duplicated
Return boolean Series denoting duplicate rows. `df.duplicated()`
50
Jittery Scattered plots
age = brfss['AGE'] + np.random.normal (0, 2.5, size=len (brfss)) weight = brfss ['WTKG3'] plt.plot(age, weight, 'o', markersize=5, alpha=0.2) plt. show() if its a small sample of data, use larger marker size.
51
Violin Plot
data = brfss.dropna(subset= ['AGE', 'WTKG3']) sns.violinplot (x='AGE', Y='WTKG3', data=data, inner=None) plt.show ()
52
Bootstrapping
Bootstrapping is resampling with replacement, all bootstrapped samples are the same size and a statistic is applied to each one ``` Bootstrapping coffee mean flavor import numpy as np mean_flavors_1000 = [1 for i in range (1000): mean_flavors_1000.append‹ np.mean (coffee_sample.sample(frac=1, replace=True) ['flavor ']) ) ``` Bootstrapp does not account for biases in the data Bootstrapping is great for figuring out standard deviations rather than mean
53
Convert series string to lowercase
print(data.str.upper())
54
random sample
print(chess.sample(n=5, random_state=42))
55
Correlation and plots
heatmaps Pairplots sns.pairplot (data=divorce) plt.show() Pairplots sns.pairplot(data=divorce, vars=["income_man", "income_woman", "marriage_duration"] plt.show () Correlation sns.heatmap (planes.corr(), annot=True) plt.show ()
56
Key density kernel KDE
Kernel Density Estimate (KDE) plots sns.kdeplot (data=divorce, ×="marriage_duration", hue="education_man", cut=0) plt.show) sns.kdeplot (data=divorce, ×="marriage_duration", hue="education_man", cut=0, cumulative=True) plt.show()
57
pandas class imblanace
value_counts OR Aggregated values with pd.crosstab0 pd.crosstab (planes ["Source"], planes ["Destination"l, values=planes ["Price"], aggfunc="median") OR pd.crosstab (planes ["Source"], planes ["Destination"])
58
pd.cut
Labels and bins labels = ["Economy", "Premium Economy", "Business Class", "First Class"] bins = [0, twenty_fifth, median, seventy_fifth, maximum] planes ["Price_Category"] = pd. cut (planes ["Price"], labels=labels,