# 4 Data Manipulation with pandas Flashcards

1
Q

1 Print the first 4 rows of the dataframe

import pandas as pd
data = [[‘tom’, 10], [‘nick’, 15],[‘john’, 29]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])

A

`Name  Age 0   tom   10 1  nick   15`
2
Q

2 What does .info() ?

A

Shows information on each of the columns, such as the data type and number of missing values.

3
Q

3 Get the number of rows and columns of the df

```      Number Letter
0       0      h
1       1      o
2       2      u
3       3      s
4       4      e```
```#Output:
(5, 2)```
A

df.shape

4
Q

4 Get the mean, count, quartiles and other statistics with one line of code

```   Number  Double
0       0       0
1       1       2
2       2       4
3       3       6
4       4       8```
A

df.describe()

```#Output
Number  Double
count     5.0     5.0
mean      2.0     4.0
std       1.6     3.2
min       0.0     0.0
25%       1.0     2.0
50%       2.0     4.0
75%       3.0     6.0
max       4.0     8.0```
5
Q

5 What does the attribute .values?

A

Gets a two-dimensional NumPy array of values.

6
Q

6 Get the columns names of df

`name  age 0   tom   10 1   nick   15`
A

df.columns

7
Q

7 What does the attribute .index ?

A

An index for the rows: either row numbers or row names.

8
Q

8 Sort df according to name

`Name  Age 0   tom   10 1  nick   15 2  juli   14`
A

print(df.sort_values(by =’name’))

name age
1 nick 15
0 tom 10

9
Q

9 Sort by name and Age (descending)

Name Age
0 tom 10
1 tom 15
2 juli 14

A

print(df.sort_values([“Name”, “Age”],ascending=[True,False]))

10
Q

10 Subset column name as dataframe

Name Age
0 tom 10
1 ana 15
2 juli 14

A

df[[‘Name’]]

name
0 tom
1 nick

11
Q

11 Filter age equal or greater than 14

name age
0 tom 10
1 nick 15
2 ana 17

A

df[df[‘Age’]>=14]

name age
1 nick 15
2 ana 17

12
Q

12 Filter age equal or greater than 14 and name not tom

`name  age 0   tom   10 1   nick   15 2   ana   17`
A

df[(df[‘age’]>=14) & (df[‘name’]!=’tom’)]

name age
1 nick 15

13
Q
```#13 Subsetting rows by categorical variables
#Get tom and ana with a conditional subsetting```
`name  age 0   tom   10 1   nick   15 2   ana   17`
A

names=[‘tom’,’ana’]

condition = df[‘Name’].isin(names)

chosen =df[condition]
print(chosen)

name age
0 tom 10

14
Q

14 Add a column. Populate it doubling values of column age

name age double
0 tom 10 20
1 nick 15 30

A

df[‘double’] = df.age*2

15
Q

15 Add a column called age_2 and subset it using a conditional filter with a cutoff of 120

name age Age_2
1 nick 15 150

A

df[‘Age_2’] = df.age*10
old_filter = df[‘Age_2’]>120
df_old =df[old_filter]
print(df_old)

16
Q

16 What is summary statistics?

A

Information that gives a quick and simple description of the data. Can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc

17
Q

17 Print the mean of column age

`name  age 1  nick   15 2   ana   17`
A

print(df.age.mean())

12.5

18
Q

18 Get maximum value and the minimum of column age

Name Age
0 peter 10
1 ana 15
2 tom 14

A

print(‘max:’, df.age.max())

print(‘min:’, df.age.min())

19
Q

19 What is IQR in Statistics?

A

IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), ​first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

20
Q

20 Create a custom a IQR function for a dataframe

A
```def iqr(column):
return column.quantile(0.75) - column.quantile(0.25)```
21
Q

21 Get the iqr of the column Double:

```Output:
A  B
0  0  0
1  1  2
2  2  4```

IQR_B: 2.0

A
```def iqr(column):
return column.quantile(0.75) - column.quantile(0.25)```

print(df)
print(‘’)
print(‘IQR_B:’, iqr(df.B))

22
Q
```#22 Create a DataFrame using a loop and getting random integers (col A).
#B is double of A.
#C is cumulative sum of A
#D cumulative max of A (seed 123)```
```#Output:
A   B   C  D
0  2   4   2  2
1  2   4   4  2
2  6  12  10  6```
A

import numpy as np
import pandas as pd

np.random.seed(123)
a = []
for i in range(3):
a.append(np.random.randint(0,10))

```df=pd.DataFrame(a, columns =['A'])
df['B'] =df.A*2
df['C']=df.A.cumsum()
df['D']=df.A.cummax()
print(df)```
23
Q

23 Drop duplicates

` name  age Uno    tom   10 Dos   nick   15 Tres  nick   15`
A

df.drop_duplicates(subset=’name’,inplace = True)
print(df)

` name  age Uno   tom   10 Dos  nick   15`
24
Q

24 Count different values in column name

` 0   1 0   tom  10 1  nick  15 2  nick  15`

nick 2
tom 1
Name: 0, dtype: int64

A

df[0].value_counts()

25
Q

25 Count the percentage of frequencies in column Age in ascending order

`name  age 0   tom   10 1  nick   15 2  nick   15`

10 0.3
15 0.7
Name: age, dtype: float64

A

round(df.age.value_counts(normalize=True, sort=False),1)

26
Q

26 Subset rows containing tom and sum their ages

`name  age 0   tom   10 1  nick   15 2  nick   19`
A

df[df[‘name’]==’nick’][‘age’].sum()

27
Q

26 Subset rows containing tom and sum their ages

`name  age 0   tom   10 1  nick   15 2  nick   19`
A

df[df[‘name’]==’nick’][‘age’].sum()

28
Q

27 Group by names and sum their money

`name  money 0   tom     10 1  nick     15 2  nick     19`
```#Output
name
nick    34
tom     10
Name: money, dtype: int64```
A

df.groupby(by=’name’)[‘money’].sum()

29
Q

28 Group by name and calculate mean and maximum value for age column

`name  money 0   tom     10 1  nick     15 2  nick     19`
`   mean  amax name             nick    17    19 tom     10    10`
A

df.groupby(by=’name’)[‘money’].agg([np.mean,np.max])

30
Q

29 Use pivot table to calculate the mean of ages for each name

name age death
0 tom 18 90
1 nick 15 75
2 nick 23 115

A

df.pivot_table(index=’Name’, values=’age’,aggfunc=’mean’)

age
name
nick 19

31
Q

30 Use pivot table to calculate the mean of ages and death, and get totals

name age death
0 nick 15 75
1 nick 23 115

A

print(round(df.pivot_table(index=’name’, values=[‘age’,’death’],
aggfunc=’mean’,margins =True),1))

```#Output:
age  death
name
nick  19.0   95.0
tom   18.0   90.0
All   18.7   93.3```
32
Q

31 Transform pivot table ‘xx’ into dataframe

```      age  death
name
nick  19.0   95.0
tom   18.0   90.0
All   18.7   93.3```
A

Output

xx.reset_index(inplace=True)

name age death
0 nick 19.0 95.0
1 tom 18.0 90.0
2 All 18.7 93.3

33
Q

32 Get tom with a boolean

name money
0 tom 10
1 nick 15
2 nick 19

```#Output
0     True
1    False
2    False
Name: name, dtype: bool```
A

selection= [‘tom’]

df.name.isin(selection)

34
Q

33 Get tom using a list. Subset data frame

name money
0 tom 10
1 nick 15
2 nick 19

Name Age Death
1 tom 12 108
2 tom 10 90

A

selection = [‘tom’]

print(df[df.name.isin(selection)])

35
Q

34 Multi index (year,month)and subset with loc

```    month  year  sale
0      1  2012    55
1      4  2015    40
2      4  2016    84
3     10  2014    31
4      9  2013    45```
`      sale year month       2012 1        55 2014 10       31`
A

df.set_index([‘year’,’month’],inplace=True)
rows_to_keep = [(2012,1),(2014,10)]
print(df.loc[rows_to_keep])

36
Q

```   month  year  sale
0      1  2012    55
1      4  2015    40
2      4  2016    84
3     10  2014    31
4      9  2013    45```
```#Output:
sale
year month
2016 4        84
2015 4        40
2014 10       31
2013 9        45
2012 1        55```
A

df.set_index([‘year’,’month’],inplace=True)

print(df.sort_index(level=[‘year’,’month’], ascending=[False,True]))

37
Q

36 Set index (year), sort and slice with loc (years,2012 to 2014)

```month  year  sale
0      1  2012    55
1      4  2015    40
2      4  2016    84
3     10  2014    31
4      9  2013    45```
```       month  sale
year
2012      1    55
2013      9    45
2014     10    31```
A

df.set_index(‘year’,inplace=True)
df.sort_index(inplace=True)
print(df.loc[2012:2014])

38
Q

37 Create a column with the length of ‘name’

`  Name  len 0    peter    5 1     ford    4 2  santana    7`
A

df[‘len’]=df[‘Name’].str.len()

39
Q

Output

```#39 Create a dataframe with shape 3,4
#with random number with mean 0 and variance 1
#time index from January 1 2013 to January 3```

np.random.seed(123)

`         0    1 2013-01-01  2.1 -2.8 2013-01-02 -0.4  0.2 2013-01-03  0.6 -0.2`
A

df = pd.DataFrame(np.random.randn(3, 2),
index=pd.date_range(‘20130101’, periods=3))

print(round(df,1))

40
Q

40 What are offset aliases?

A

A number of string aliases given to useful common time series frequencies.

e.g ‘Y’==year

41
Q

41 Slice in both directions.

`                  0    1 2013-01-01  2.1 -2.8 2013-01-02 -0.4  0.2 2013-01-03  0.6 -0.2`
`       0 2013-01-01  0.550524 2013-01-02 -1.108726`
A

print(df.iloc[:2,:1])

42
Q

42 Slice Time series starting with 2014

`                  0 2013-12-31 -0.539127 2014-12-31  0.055680 2015-12-31  0.136186`
A

Output:

print(df.loc[‘2014’:])

`            0 2014-12-31  0.055680 2015-12-31  0.136186`
43
Q

43 Populate column year

`        0 2013-12-31 -0.539127 2014-12-31  0.055680 2015-12-31  0.136186`
`            0  year 2013-12-31 -0.167130  2013 2014-12-31  0.025054  2014 2015-12-31 -0.885574  2015`
A

df[‘year’] = df.index.year

44
Q

44 Reset index (date)and change its name for ‘date’ using a dictionary

`   0 2013-12-31  1.069279 2014-12-31  0.598574 2015-12-31 -0.247078`
A

Output

df=df.reset_index().rename(columns={‘index’: ‘bar’})

date 0
0 2013-12-31 0.285032
1 2014-12-31 2.681187
2 2015-12-31 -0.249268

45
Q

45 Delete column col ‘len’ from df

` Name  len 0    peter    5 1     ford    4 2  santana    7`
A

del(df[‘len’])

```#Output:
Name
0    peter
1     ford
2  santana```
46
Q

46 Transform dataframe into pivot table

A B C
0 Fer Masters NaN

```B     Graduate  Masters
A
Fer        NaN      NaN
Ger       22.0      NaN
John       NaN      NaN```
A

df.pivot(‘A’, ‘B’, ‘C’)

Or

df.pivot(index =’A’, columns =’B’, values =’C’)

47
Q

47 Subsetting pivot tables (pt)

```B     Graduate  Masters
A
Boby      23.0      NaN
John       NaN     27.0
Mina      21.0      NaN```

A
Fer NaN NaN
Ger 22.0 NaN

A

pt_sub = pt.loc[‘Fer’:’Ger’]

print(pt_sub)

48
Q

48 Calculate the mean of the columns in this pivot table (pt)

```B     Graduate  Masters
A
Boby      23.0      NaN
John       NaN     27.0
Mina      21.0      NaN```
```#Output:
B
Masters      NaN
dtype: float64```
A

pt.mean()

49
Q

49 Group by titles, calculate the age mean and visualize it with a bar plot (matplotlib)

`  A         B   C 0  John   Masters  27 1  Boby  Graduate  23 2  Mina  Graduate  21`

B
Masters 15.0
Name: C, dtype: float64

A

import matplotlib.pyplot as plt
gb = df.groupby(‘B’)[‘C’].mean()
gb.plot(kind =’bar’)
plt.show()

50
Q

50 Group by dates and visualize column 0 with a line chart

dates 0
0 2013-12-31 0.537526
1 2014-12-31 -0.290812
2 2015-12-31 0.679620

```dates
2013-12-31    0.537526
2014-12-31   -0.290812
2015-12-31    0.679620
Name: 0, dtype: float64```
A
```import matplotlib.pyplot as plt
df_grouped=df.groupby(by='dates')[0].sum()
df_grouped.plot(kind='line')
print(df_grouped)
plt.show()```
51
Q

51 Scatter plot with 0 in the x axis and Scatter as title

import pandas as pd
df = pd.DataFrame(np.random.randn(3))
print(df)

A

Output: (plot in ipynb doc)

import matplotlib.pyplot as plt

df. reset_index().plot(kind=’scatter’, x=’index’, y=0)
plt. show()

52
Q

52 display two histograms showing Titles (C ).Transparency 0.5 and two bins

df = pd.DataFrame({‘Uni’: [‘Harvard’, ‘UOC’, ‘Stanford’,’Tsinghua’],
‘Title’: [‘Master’, ‘MBI’, ‘Master’,’MBI’],
‘Students’: [15, 22, 19,13]})

A

df[df[‘Title’]==’MBI’][‘Students’].hist(alpha=0.5,bins=2)
df[df[‘Title’]==’Master’][‘Students’].hist(alpha=0.5,bins=2)
plt.legend([‘MBI’,’Master’])
plt.show()

53
Q

53 Replace number 21 in df with NaN values

```    Uni   Title  Students
0   Harvard  Master        21
1       UOC     MBI        21
2  Stanford  Master        19
3  Tsinghua     MBI        13```
A

df.replace({21:np.nan})

```#Output:
Uni   Title  Students
0   Harvard  Master       NaN
1       UOC     MBI       NaN
2  Stanford  Master      19.0
3  Tsinghua     MBI      13.0```
54
Q

54 Check if there are NaN values for each column

```   Uni   Title  Students
0   Harvard  Master       NaN
1       UOC     MBI       NaN
2  Stanford  Master      19.0
3  Tsinghua     MBI      13.0```
A

Uni False

df.isna().any()

Title False
Students True
dtype: bool

55
Q

55 Count Nan values in each column and in the total

```        Uni   Title  Students
0   Harvard  Master       NaN
1       UOC     MBI       NaN
2  Stanford  Master      19.0
3  Tsinghua     MBI      13.0```
A

nan_cols=df.isna().sum()
nan_total=df.isna().sum().sum()
print(‘each col:\n’, nan_cols)
print(‘total:’,nan_total)

```#Output:
each col:
Uni         0
Title       0
Students    2
dtype: int64
total: 2```
56
Q

56 Delete Nan values

`     Uni   Title  Students 0   Harvard  Master       NaN 1       UOC     MBI       NaN 2  Stanford  Master      19.0`
A

Output:

df.dropna()

`  Uni   Title  Students 2  Stanford  Master      19.0`
57
Q

57 Transform NaN values into 0s

`   Uni   Title  Students 0   Harvard  Master       NaN 1       UOC     MBI       NaN 2  Stanford  Master      19.0`
A

Output:

df.fillna(0)

Uni Title Students
0 Harvard Master 0.0
1 UOC MBI 0.0
2 Stanford Master 19.0