Second Flashcards

1
Q

Build df from dictionary

A

pd.DataFrame()

dict = { “country” : [“Nepal”, “India”, “China”], “Capital” : [‘Kathmandu”, “Delhi”, “Bejing”]}

SA = pd.DataFrame(dict)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

CSV to DataFrame

A

df = pd.read_csv(‘path/to/dataframe.csv)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

CSV to DataFrame

1. Index the df

A

index_col = 0

df = pd.read_csv(‘path/to/dataframe.csv, index_col = 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sort from highest to lowest

A

df.sort_values(‘col’, ascending = False)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sort by multiple variables

A

df.sort_values([‘col1’, ‘col2’])

  1. Sort by the col1
  2. Then sort by col2

dogs.sort_values([‘weight_kg’, ‘height_cm’], ascending = [True, False])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Subset rows

A

df[col] > #

0 True
1 False
2 True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Subset rows and get data

A

df[df[‘col’] > #]

0 Bella Labrador Brown
4 Max Labrador Black

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Subset based on a string

A

df[df[‘col1’] == ‘string’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Subset based on date

A

df[df[‘col_date’] > “2015-01-01”]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Subset on multiple condiditons

A

is_x = df[‘col1’] == ‘x’
is_y = df[‘col2’] == ‘y’
df[is_x & is_y]

 col1   col2   number 0    x        y          56
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Subset using isin()

A

is_x_or_y = df[‘col1].isin([‘x’, ‘y’])
df[is_x_or_y]

is_black_or_brown = dogs[‘color’].isin([‘black’, ‘brown’])
df[is_black_or_brown]

  name    color       height   0     Bel       brown       88 4     Max      black        55
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Adding new column + mutating df

A

df[‘new_col’] = df[‘col’] / 100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Individuals per 10K

A

df[‘per_10k’] = 10000 * df[‘number’] / df[‘total’]

homelessness[‘indv_per_10k’] = 10000 * homelessness[‘individuals’] / homelessness[‘state_pop’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Summary Stats

A

df[‘col’].mean()

df[‘col’].median()

df[‘col’].mode()

df[‘col’].min()

df[‘col’].max()

df[‘col’].var()

df[‘col’].std()

df[‘col’].sum()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Quantile

A

df[‘col’].quantile()

Where a sample is divided into equal-sized, adjacent subgroups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Aggregate summary stats from one column

A

df[‘col’].agg(‘function’)

df[‘ext price’].agg(‘sum’)

             ext price sum          2.1
17
Q

Aggregate multiple summary stats from multiple columns

A

df[[‘col1’, ‘col2’]].agg([‘func1’, ‘func2’])

df[[‘price’, ‘quantity’]].agg([‘sum’, ‘mean’])

           price       quantity sum          2.1            25 mean        1                12
18
Q

Cumulative Sum

A

df[‘col1’].cumsum()

19
Q

Drop duplicates

A

df.drop_duplicates(subset = ‘col’)

vet_visits.drop_duplicates(subset = ‘breed’)

20
Q

Drop duplicates multiple columns

A

df.drop_duplicates(subset = [‘col1’, ‘col2’])

vet_visits.drop_duplicates(subset = [‘name’, ‘breed’])

21
Q

Count number of occurrences

A

df[‘col’].value_counts()

unique_dogs[‘breed’].value_counts()

labrador 4
Chow Chow 3
Poodle 2

22
Q

Count number of occurrences & sort values

A

df[‘col’].value_counts(sort = True)

23
Q

Count proportions of Total

A

df[‘col’].value_counts(normalize = True)

Lab 0.5
Chow 0.25
Poodle 0.25

24
Q

Grouped Summary Statistics by attribute Pt2

A

df[df[‘col’] == ‘attribute’][‘measure col’].stat()

dogs[dogs[‘color’] == ‘black’][‘weight’].mean()

dogs[dogs[‘color’] == ‘black’][‘weight’].sum()
dogs[dogs[‘color’] == ‘brown’][‘weight’].sum()

26
24

25
Q

Grouped Summary Statistics (groupby) Pt2

A

df. groupby(‘col’)[‘measure col’].stat()
dogs. groupby(‘color’)[‘weight’].mean()

color
black 26
brown 24

26
Q

Multiple stats groupby summary

A

df. groupby(‘col’)[‘measure col’].agg([min, max, sum])
dogs. groupby(‘color’)[‘weigth’].agg([min, max, sum])

        min       max       sum color black    24         29           53 brown   24        24           48
27
Q

Group by multiple variables

A

df. groupby([‘col1’, ‘col2’])[‘measure col’].stat()
dogs. groupby([‘color’, ‘breed’])[‘weight’].mean()

color     breed
black     chow           25
              lab               29
              poodle        24
Brown   chow           24
              lab               25
28
Q

Group by multiple cols and multiple measures

A

df.groupby([‘col1’, ‘col2’])[[‘measure col1’], [‘measure col2’]].stat()

29
Q

Pivot Table

A

df.pivot_table(values = ‘measure col’, index = ‘group_col’)

By default it takes the mean value of each group

dogs.pivot_table(values = ‘weight’, index = ‘color’)

                 weight color black             26 brown           24
30
Q

Pivot Table: Change the statistic

A

aggfunc =

df.pivot_table(values = ‘measure col’, index = ‘group_col’, aggfunc = np.median)

31
Q

Pivot Table: Multiple statistics

A

aggfunc = []

df.pivot_table(values = ‘measure col’, index = ‘group_col’, aggfunc = [np.median, np.mean])

32
Q

Pivot Table: 2 Variables

A

columns =

df. pivot_table(values = ‘measure col’, index = ‘group_col’, columns = ‘group_col2’)
dogs. pivot_table(values = ‘weight’, index = ‘color’, columns = ‘breed’)

breed chow lab poodle
color
black NA 29 24
brown 24 24 NA

33
Q

Pivot table: Fill Missing Values

A

fill_value =

df. pivot_table(values = ‘measure col’, index = ‘group_col’, fill_value = 0)
dogs. pivot_table(values = ‘weight’, index = ‘color’, columns = ‘breed’, fill_value = 0)

breed chow lab poodle
color
black 0 29 24
brown 24 24 0

34
Q

Pivot Table: Sum All

A

df. pivot_table(values = ‘measure col’, index = ‘group_col’, fill_value = 0, margins = True)
dogs. pivot_table(values = ‘weight’, index = ‘color’, columns = ‘breed’, fill_value = 0, margins = True)

breed  chow  lab  poodle  All 
color
black    0        29     24       53
brown  24        24    0         48
All         24        53    24