Python & Plots Flashcards

Question

How to filter rows in pandas

Answer 1

``` test = test [test['state'].isin(canu)] ```

Answer 2

``` def pct30 (column): return column. quantile (0.3) dogs["weight_kg"].agg([pct30, pct40]) dogs ["weight_kg"].agg(pct30) dogs [["weight_kg", "height_cm"]].agg (pct30) Group the results by title then count the number of accounts counted_df = licenses_owners.groupby ('title'). agg ({'account' : 'count '}) ```

Answer 3

``` ales_C = sales [sales ['type'] == "C"][ 'weekly_sales '].sum() ``` subset without aggregation: `avocados[avocados["type"] == "conventional"]["avg_price"]` Returns only the price column for matching types #Count the number of rows in the budget column that are missing `number_of_missing_fin = movies_financials ['budget']. isnull() . sum`

Answer 4

``` avg_weight_by_breed = dog_pack.groupby ("breed") ["weight_kg"].mean () ```

Answer 5

``` df.isna().any () ``` Plotting missing values ``` import matplotlib.pyplot as plt dogs.isna ().sum().plot(kind="bar") plt.show () ``` Dealing with missing values ``` dogs.dropna() dogs.fillna(0) ```

Answer 6

``` list_of_dicts = [ {"name": "Ginger", "breed": "Dachshund", "height_cm": 22, "weight_kg": 10, "date_of_birth": "2019-03-14"}, {"name": "Scout", "breed": "Dalmatian", "height_cm": 59, "weight_kg": 25, "date_of_birth": "2019-05-09"} new_dogs = pd. DataFrame (list_of_dicts) print (new_dogs) ```

Answer 7

``` dict_of_lists = { "name": ["Ginger", "Scout"], "breed": ["Dachshund", "Dalmatian"], "height_cm": [22, 591, "weight_kg": [10, 25], "date_of_birth": ["2019-03-14", 12019-05-09"7 } new_dogs = pd.DataFrame (dict_of_lists) ```

Answer 8

``` wards_census = wards.merge (census, on='ward', suffixes= ('_ward', '_cen')) ``` Multiple tables ``` grants_licenses_ward = grants.merge (licenses, on=['address', 'zip']) \ .merge (wards, on='ward', suffixes= ('_bus', '_ward')) grants_licenses_ward.head() ``` Merge with left join ``` movies_taglines = movies.merge (taglines, on='id', how='left') print (movies_taglines.head()) ``` Different columns ``` movies_and_scifi_only = movies. merge (scifi_only, how='inner' Left_on= id, right_on='movie_id") ``` An indicator to know source table and comminalities (left_only, both, right_only) ``` genres_tracks = genres.merge (top_tracks, on='gid', how='left', indicator=True) ```

Answer 9

``` pd.concat([dfA, dfB), ignore_index = True,] ``` Concat tables with different column names ``` pd.concat([inv_jan, inv_feb], sort=True) ``` OR `pd. concat (linv_jan, inv_feb], join='inner')`

Answer 10

``` inv_jan.append([inv_feb, inv_mar], ignore_index=True, sort=True) ```

Answer 11

Validating merges .merge (validate=None) `one__to__one, many_to_one, many_to_many` Verifying concatenations .concat (verify_integrity=False) : * Check whether the new concatenated **index** contains duplicates * Default value is False

Answer 12

`popular_classic = classic_18_19 [classic_18_19[ 'tid'].isin(classic_pop['tid'l)]`

Answer 13

* Great for time series data * * Column(s) to join = on , left_on, and right_on * Type of join * how (left, right, inner, outer) * default outer * Overlapping column names * suffixes * Calling the function * pd.merge_ordered (df1, df2) * fill_method = 1) ffill (forward fill) fills data from previous field in previous row(above it)

Answer 14

* Similar to a merge_ordered ( )left join * Match on the nearest **key column and not exact matches.** * Merged **"on" columns must be sorted.** ``` pd.merge_asof (visa, ibm, on=['date_time'], suffixes= (' _visa','_ibm'), direction='forward') ``` The defualt direction is backwards, but you can also choose *forward*, *nearest*

Answer 15

* query ('SOME SELECTION STATEMENT') * Accepts an input string * Input string used to determine what rows are returned * Input string similar to statement after WHERE clause in SQL statement `stocks.query ('nike >= 90')` `stocks_long.query('stock=="disney" or (stock=="nike" and close < 90)')` accessing an a date column thats an index `recent_gdp_pop = gdp_pivot.query ( date<= 11991-01-01" )`

Answer 16

``` import numpy as np dogs.pivot_table (values="weight_kg", index="color", aggfunc=p.median) Filling missing values in pivot tables dogs.pivot_table (values="weight_kg", index="color", columns="breed", fill_value=0) ```

Answer 17

* To make analysis of data in table easier, we can reshape the data into a more computer-friendly form * Pandas.melt() unpivots a DataFrame from wide format to long format. `social_fin_tall = social_fin.melt(id_vars=['financial', 'company'])` You can melt certain values in a column. Here its only values in the financial column that are equal to 2018 & 2017 ``` social_fin_tall = social_fin.melt(id_vars= ['financial', 'company'], value_vars= ['2018', '2017'], var_name= ['year'], value_name='dollars', aggfunc = np.mean ) ``` think of the id_var as the columns you want to keep the same, the rest will be organised into proper rows

Answer 18

`from scipy.stats import uniform uniform.cdf (7, 0, 12)` its used in continuous distributions where we need to find the area under the graph for that distribution

Answer 19

``` import matplotlib.pyplot as plt import seaborn as sns sns.scatterplot (x="total_bill", y="tip" data=tips, hue="smoker", hue_order= ["Yes", "No"7) plt.show() ``` OR ``` import matplotlib.pyplot as plt import seaborn as sns hue_colors = {"Yes": "black" "No": "red"} sns.scatterplot (x="total_bill", y="tip" data=tips, hue="smoker", palette=hue_ colors, size='size') plt.show () ``` Relationship plot and a few extra varaibles that it takes ``` import seaborn as sns import matplotlib.pyplot as plt sns.relplot (x="total_bill", y="tip", data=tips, kind="scatter" col="day", col_wrap=2, col_order= ["Thur", "Fri", "Sat", "Sun"1) plt.show () ``` different point style and transparency ``` Set alpha to be between 0 and 1 sns.relplot (×="total_bill", y="tip", data=tips, kind="scatter" alpha=0.4) ```

Answer 20

Subgroups by location ``` import matplotlib.pyplot as plt import seaborn as sns sns.relplot (x="hour", y="NO_2_mean", data=air_df_Loc_mean, kind="line" style="location", hue="location") plt.show() ``` Multiple observations per x-value, this automatically plots confidence interval ``` import matplotlib.pyplot as plt import seaborn as sns sns.relplot (x="hour", y= "NO_2", data=air_df, kind="line") plt.show() ``` Plotting standard deviation ``` import matplotlib.pyplot as plt import seaborn as sns sns.relplot (×="hour", y="NO_2", data=air_df, kind="line" ci="sd") plt.show() ```

Answer 21

Line plot ``` import maplotlib.pyplot as plt import seaborn as sns category_order = ["No answer", "Not at all", "Not very", "Somewhat" "Very"] sns.catplot (x="how_masculine" data=masculinity_data, kind="count", order=category_order) plt.show() ``` Bar plots show the mean of quantitativ evariable per category BOX Plots sym changes the appearance of outliers ``` g = sns. catplot (x="time" y="total_bill", data=tips, kind="box" ,sym="") ``` to change whiskers, default is 1.5*iqr whis = 2, or whis = [5,95]

Answer 22

* Points show mean of quantitative variable * Line plot has quantitative variable (usually time) on x-axis * Point plot has categorical variable on x-axis ``` import matplotlib.pyplot as plt import seaborn as sns sns.catplot(x="age" y="masculinity_important" , data=masculinity_data, hue="feel_masculine" kind="point", join=False, estimator=median (more robust to outliers) capsize=0.2, ) plt.show() ```

Answer 23

`sns.set_palette("RdBU")` `sns.set_style('whitegrid')` Changing the scale * Figure "context" changes the scale of the plot elements and labels * sns.set_context () * Smallest to largest: "paper", "notebook", "talk", "poster" FacetGrid vs. AxesSubplot objects Seaborn plots create two different types of objects: FacetGrid and AxesSubplot `g = sns.scatterplot(x="height", y="weight", data=df) type (g)` for facetgrid = g.fig.suptitle('title') for AxesSubplot = g.set_title('title) Titles for subplots ``` 9 = sns.catplot (x="Region", y="Birthrate", data=gdp_data, kind="box", col="Group") g.fig. suptitle ("New Title",y=1.03) g.set_titles("This is {col_name}") ``` Adding axis labels - works for both ``` g = sns. catplot (x="Region", y="Birthrate", data=gdp_data, kind="box") g.set (xlabel="New X Label", ylabel="New Y Label") plt.show () ``` Rotating ×-axis tick labels ``` g = sns.catplot (x="Region", y="Birthrate" data=gdp_data, kind="box") plt.xticks (rotation=90) plt.show() ```

Answer 24

``` jobs['roles'] = jobs['roles'].str.lower() ``` ``` print(contact.email.str.split('@', expand = True)) ``` ``` print(s.str.startswith('re')) ```

Answer 25

Return boolean Series denoting duplicate rows. `df.duplicated()`

Answer 26

age = brfss['AGE'] + np.random.normal (0, 2.5, size=len (brfss)) weight = brfss ['WTKG3'] plt.plot(age, weight, 'o', markersize=5, alpha=0.2) plt. show() if its a small sample of data, use larger marker size.

Answer 27

data = brfss.dropna(subset= ['AGE', 'WTKG3']) sns.violinplot (x='AGE', Y='WTKG3', data=data, inner=None) plt.show ()

Answer 28

Bootstrapping is resampling with replacement, all bootstrapped samples are the same size and a statistic is applied to each one ``` Bootstrapping coffee mean flavor import numpy as np mean_flavors_1000 = [1 for i in range (1000): mean_flavors_1000.append‹ np.mean (coffee_sample.sample(frac=1, replace=True) ['flavor ']) ) ``` Bootstrapp does not account for biases in the data Bootstrapping is great for figuring out standard deviations rather than mean

Answer 29

print(data.str.upper())

Answer 30

print(chess.sample(n=5, random_state=42))

Answer 31

heatmaps Pairplots sns.pairplot (data=divorce) plt.show() Pairplots sns.pairplot(data=divorce, vars=["income_man", "income_woman", "marriage_duration"] plt.show () Correlation sns.heatmap (planes.corr(), annot=True) plt.show ()

Answer 32

Kernel Density Estimate (KDE) plots sns.kdeplot (data=divorce, ×="marriage_duration", hue="education_man", cut=0) plt.show) sns.kdeplot (data=divorce, ×="marriage_duration", hue="education_man", cut=0, cumulative=True) plt.show()

Answer 33

value_counts OR Aggregated values with pd.crosstab0 pd.crosstab (planes ["Source"], planes ["Destination"l, values=planes ["Price"], aggfunc="median") OR pd.crosstab (planes ["Source"], planes ["Destination"])

Answer 34

Labels and bins labels = ["Economy", "Premium Economy", "Business Class", "First Class"] bins = [0, twenty_fifth, median, seventy_fifth, maximum] planes ["Price_Category"] = pd. cut (planes ["Price"], labels=labels,

Python & Plots Flashcards

(58 cards)