Lesson7 Numpy_Pandas analysis Flashcards
Create an array of 10 zeros and ensure they are integers.
np.zeros(10, dtype=’int’)
Create a matrix with a predefined value of 5.45 with 3 rows and 5 cols.
np.full((3,5),5.45)
Create an array of even space between 0 and 2. Do this for 5 numbers.
np.linspace(0, 2, 5)
create a 3x3 array with random numbers (0-1) with a normal distribution. Specify that they have a mean 0 and standard deviation 1.
np.random.normal(0, 1, (3,3))
Combine the following arrays x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = [21,21,21]
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
z = [21,21,21]
np.concatenate([x, y,z])
Concatenate the grid array twice grid = np.array([[1,2,3],[4,5,6]]).
grid = np.array([[1,2,3],[4,5,6]])
np.concatenate([grid,grid])
Create a dataframe using a dictionary with the columns: Fruit and Items (the values list for items is 121,40,100,130,11] and the values for fruit Fruit’: [‘Peach’,’Apple’,’Pear’,’Plum’,’Kiwi’.
data = pd.DataFrame({‘Fruit’: [‘Peach’,’Apple’,’Pear’,’Plum’,’Kiwi’],
‘Items’:[121,40,100,130,11]})
How do you get complete information on the dataset
data.info()
Make a dataframe with the column name group, kg. Group values: ‘a’, ‘a’, ‘a’, ‘b’,’b’, ‘b’, ‘c’, ‘c’,’c’, kg values: 4, 3, 12, 6, 7.5, 8, 3, 5, 6
data = pd.DataFrame({‘group’:[‘a’, ‘a’, ‘a’, ‘b’,’b’, ‘b’, ‘c’, ‘c’,’c’],’kg’:[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
Sort the values in the data df by kg. Do this for ascending and change the original df.
data = pd.DataFrame({‘kg’: [‘a’,’a’,’a’,’b’,’b’,’b’,’c’,’c’,’c’], ‘kg values’: [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data.sort_values(by=[‘kg’],ascending=True,inplace=True)
Sort by multiple columns - do this for data. Sort group by ascending order and kg by descending order. Make sure you don’t modify the original dataset.
data.sort_values(by=[‘group’,’kg’],ascending=[True,False],inplace=False)
data = pd.DataFrame({‘names’:[‘Mila’]3 + [‘Igor’]4, ‘Age’:[3,2,1,3,3,4,4]})
remove duplicates
data.drop_duplicates()
Remove duplicate values from the name column
data = pd.DataFrame({‘names’:[‘Mila’]3 + [‘Igor’]4, ‘Age’:[3,2,1,3,3,4,4]})
data.drop_duplicates(subset=’names’)
for the farm shop df (data) create a new column animal 2 that shows the result of the meat to animal. Ensure they are all lowercase.
data[‘animal’] = data[‘food’].map(str.lower).map(meat_to_animal)
Remove animal 2 from dataset (series only).
data.drop(‘animal2’,axis=’columns’,inplace=True)
Make a new series using assign
data.assign(new_variable = data[‘kg’]*10)
Make a dataframe that has values 1-11, in a matrix of 3 rows and 4 columns. Use the index names
index=[‘London’, ‘Manchester’, ‘Brighton’],
columns=[‘one’, ‘two’, ‘three’, ‘four’])
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=[‘London’, ‘Manchester’, ‘Brighton’],
columns=[‘one’, ‘two’, ‘three’, ‘four’])
Rename Manchester to Cardiff and in the columns one to one_p and two to two_p for the dataframe data. Make sure to change the original df.
data.rename(index = {‘Manchester’:’Cardiff’}, columns={‘one’:’one_p’,’two’:’two_p’},inplace=True)
convert the index to capital letters and columns to title.
data.rename(index = str.upper, columns=str.title,inplace=True)
Create categories for this variable ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]. Use the bins bins = [18, 25, 35, 60, 100]
categories = pd.cut(ages, bins)
Include the left bin value
pd.cut(ages,bins,right=False)
See how many observations (the frequency or count of observations that belong to each bin) fall under each bin. Do this for the categories variable.
pd.value_counts(categories)
Add unique name to each category then check how many observations fall under each bin. bin_names = [‘Youth’, ‘Early 20s’, ‘Middle Age’, ‘Senior’]
bin_names = [‘Youth’, ‘Early 20s’, ‘Middle Age’, ‘Senior’]
new_cats = pd.cut(ages, bins,labels=bin_names)
pd.value_counts(new_cats)
Create a df date starting from 20210701 with a length of 7 periods. Then create a pandas DataFrame with 7 rows and 4 columns, with random values generated from a normal distribution the row index is set to the ‘dates’ variable created above and the columns are labeled ‘A’, ‘B’, ‘C’, and ‘D’
dates = pd.date_range(‘20210701’,periods=7)
df = pd.DataFrame(np.random.randn(7,4),index=dates,columns=list(‘ABCD’))
df