Pandas Pro Flashcards

1
Q

What is chaining?

A

Perform multiple operations in a single line of code

df_updated = (df
.query(“release_year>2018”) # Get movies and shows only released after 2018
.loc[:, [“title”, “release_year”, “duration”]] # Get only these
.assign(over_three_hours=lambda dataframe: np.where(dataframe[“duration”] > 180, “Yes”, “No”)) # Create new column called over_three_hours depending on duration > 180
.groupby(by=[“release_year”, “over_three_hours”]) # Group by given columns
.count() # Get count of movies by release_year and over_three_hours
)
df_updated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

nlargest and nsmallest

A

Insteadof usingsort_valuesto find the largest or smallest values in your data, consider usingnlargestandnsmallest. These functions are faster and more memory-efficient, making them a great choice for large datasets

df.nsmallest(3, “age”) # Youngest 3 passengers
df.nlargest(3, “age”) # Oldest 3 passengers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Filtering data with .query() method

A

Pandas’queryfunction allows you to filter your data using logical expressions. You can also use@symbols to refer to variables in your query, making it a convenient and powerful tool for filtering data.

df[“embark_town”].unique() # [‘Southampton’, ‘Cherbourg’, ‘Queenstown’, nan]

embark_towns = [“Southampton”, “Queenstown”] # Only want to select these towns

df.query(“age>21 & fare>250 & embark_town==@embark_towns”).head()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

df.cut Method

A

Child - 0 to 9 years

The cut function is a useful tool for binning your data into discrete categories. This can be useful for visualizing your data or for transforming continuous variables into categorical ones.

# Teen - 10-19 years
# Young - 19 to 24 years
# Adult - 25 to 59
# Elderly > 59
bins = [0, 10, 19, 24, 59, float(‘inf’)]
labels = [“Child”, “Teen”, “Young”, “Adult”, “Elderly”]
df[“age”].hist()
plt.show()
df[“age_category”] = pd.cut(df[“age”], bins=bins, labels=labels)
sorted_df = df.sort_values (by=”age_category”)
sorted_df[“age_category”].hist()
plt.show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Avoid using inplace

A

using inplace to remove the first row of the DataFrame directly

Theinplaceparameter in Pandas allows you to perform operations directly on your dataframe, but it can be dangerous to use, as it can make your code harder to read and debug. Instead, try to use the standard method of assigning the result of your operation to a new object.

# df.drop(0, inplace=True)

df = df.drop(0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Avoid unnecessary apply

A

Calculate the win probability element-wise for each row using the specified formula

The apply function can be a powerful tool, but it can also be slow and memory-intensive. Try to avoid using apply when there are direct, faster and more efficient ways to accomplish your goal.

columns = [‘space_ship’, ‘galaxy’, ‘speed’,
‘maneuverability’, ‘pilot_skill’, ‘resource_management’]

df[‘win_prob’] = (df[‘speed’] * df[‘maneuverability’] * df[‘pilot_skill’]) / df[‘resource_management’]

# Using .apply()
# df[‘win_prob’] = df.apply(lambda row: (row[‘speed’] * row[‘maneuverability’] * row[‘pilot_skill’]) / row[‘resource_management’], axis=1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

df.sample(n)

A

It displays the random n number of rows in the sample data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

df.shape

A

It displays the sample data’s rows and columns (dimensions).
(2823, 25)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

df.describe()

A

Get the basic statistics of each column of the sample data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

df.info()

A

Get the information about the various data types used and the non-null count of each column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

df.memory_usage()

A

It will tell you how much memory is being consumed by each column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

df.iloc[row_num]

A

It will select a particular row based on its index
For ex-,

df.iloc[0]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

df[[‘col1’, ‘col2’]]

A

It will select multiple columns given

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

df.isnull()

A

This will identify the missing values in your dataframe.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

df.dropna()

A

This will remove the rows containing missing values in any column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

df.fillna(val)

A

df.fillna(val): This will fill the missing values with val given in the argument.

17
Q

df[‘col’].astype(new_data_type)

A

It can convert the data type of the selected columns to a different data type

18
Q

Aggregation Functions:

A

You can group a column by its name and then apply some aggregation functions like sum, min/max, mean, etc.

df.groupby(“col_name_1”).agg({“col_name_2”: “sum”})

For ex-,

df.groupby(“CITY”).agg({“SALES”: “sum”})
If you want to apply multiple aggregations at a single time, you can write them like that.

For ex-,

aggregation = df.agg({“SALES”: “sum”, “QUANTITYORDERED”: “mean”})

Output:

SALES 1.003263e+07

QUANTITYORDERED 3.509281e+01

dtype: float64

19
Q

Filtering Data

A

Filtering Data:

We can filter the data in rows based on a specific value or a condition.

For ex-,

df[df[“SALES”] > 5000]

Displays the rows where the value of sales is greater than 5000

You can also filter the dataframe using thequery()function. It will also generate a similar output as above.

For ex,

df.query(“SALES” > 5000)

20
Q

Pivot Tables

A

Master of Science in Informations Systems
Master of Science in Business Analytics

10 Essential Pandas Functions Every Data Scientist Should Know
This article contains ten Pandas functions that are important as well as handy for every data scientist.
By Aryan Garg, KDnuggets on November 10, 2023 in Data Science
FacebookTwitterLinkedInRedditEmailShare

10 Essential Pandas Functions Every Data Scientist Should Know
Image by Author

In today’s data-driven world, data analysis and insights help you get the most out of it and help you make better decisions. From a company’s perspective, it gives a Competitive Advantage and personaliz?s the whole process.

This tutorial will explore the most potent Python library pandas, and we will discuss the most important functions of this library that are important for data analysis. Beginners can also follow this tutorial due to its simplicity and efficiency. If you don’t have python installed in your system, you can use Google Colaboratory.

Importing Data

You can download the dataset from that link.

import pandas as pd
df = pd.read_csv(“kaggle_sales_data.csv”, encoding=”Latin-1”) # Load the data

df.head() # Show first five rows

Output:

10 Essential Pandas Functions Every Data Scientist Should Know

Data Exploration

In this section, we will discuss various functions that help you to get more about your data. Like viewing it or getting the mean, average, min/max, or getting information about the dataframe.

  1. Data Viewing

df.head(): It displays the first five rows of the sample data

10 Essential Pandas Functions Every Data Scientist Should Know

df.tail(): It displays the last five rows of the sample data

10 Essential Pandas Functions Every Data Scientist Should Know

df.sample(n): It displays the random n number of rows in the sample data
df.sample(6)

10 Essential Pandas Functions Every Data Scientist Should Know

df.shape: It displays the sample data’s rows and columns (dimensions).
(2823, 25)

It signifies that our dataset has 2823 rows, each containing 25 columns.

  1. Statistics

This section contains the functions that help you perform statistics like average, min/max, and quartiles on your data.

df.describe(): Get the basic statistics of each column of the sample data

10 Essential Pandas Functions Every Data Scientist Should Know

df.info(): Get the information about the various data types used and the non-null count of each column.

10 Essential Pandas Functions Every Data Scientist Should Know

df.corr(): This can give you the correlation matrix between all the integer columns in the data frame.

10 Essential Pandas Functions Every Data Scientist Should Know

df.memory_usage(): It will tell you how much memory is being consumed by each column.

10 Essential Pandas Functions Every Data Scientist Should Know

  1. Data Selection

You can also select the data of any specific row, column, or even multiple columns.

df.iloc[row_num]: It will select a particular row based on its index
For ex-,

df.iloc[0]

df[col_name]: It will select the particular column
For ex-,

df[“SALES”]

Output:

10 Essential Pandas Functions Every Data Scientist Should Know

df[[‘col1’, ‘col2’]]: It will select multiple columns given
For ex-,

df[[“SALES”, “PRICEEACH”]]

Output:

10 Essential Pandas Functions Every Data Scientist Should Know

  1. Data Cleaning

These functions are used to handle the missing data. Some rows in the data contain some null and garbage values, which can hamper the performance of our trained model. So, it is always better to correct or remove these missing values.

df.isnull(): This will identify the missing values in your dataframe.
df.dropna(): This will remove the rows containing missing values in any column.
df.fillna(val): This will fill the missing values with val given in the argument.
df[‘col’].astype(new_data_type): It can convert the data type of the selected columns to a different data type.
For ex-,

df[“SALES”].astype(int)

We are converting the data type of the SALES column from float to int.

10 Essential Pandas Functions Every Data Scientist Should Know

  1. Data Analysis

Here, we will use some helpful functions in data analysis, like grouping, sorting, and filtering.

Aggregation Functions:
You can group a column by its name and then apply some aggregation functions like sum, min/max, mean, etc.

df.groupby(“col_name_1”).agg({“col_name_2”: “sum”})

For ex-,

df.groupby(“CITY”).agg({“SALES”: “sum”})

It will give you the total sales of each city.

10 Essential Pandas Functions Every Data Scientist Should Know

If you want to apply multiple aggregations at a single time, you can write them like that.

For ex-,

aggregation = df.agg({“SALES”: “sum”, “QUANTITYORDERED”: “mean”})

Output:

SALES 1.003263e+07

QUANTITYORDERED 3.509281e+01

dtype: float64

Filtering Data:
We can filter the data in rows based on a specific value or a condition.

For ex-,

df[df[“SALES”] > 5000]

Displays the rows where the value of sales is greater than 5000

You can also filter the dataframe using the query() function. It will also generate a similar output as above.

For ex,

df.query(“SALES” > 5000)

Sorting Data:
You can sort the data based on a specific column, either in the ascending order or in the descending order.

For ex-,

df.sort_values(“SALES”, ascending=False) # Sorts the data in descending order

Pivot Tables:
We can create pivot tables that summarize the data using specific columns. This is very useful in analyzing the data when you only want to consider the effect of particular columns.

For ex-,

pd.pivot_table(df, values=”SALES”, index=”CITY”, columns=”YEAR_ID”, aggfunc=”sum”)

Let me break this for you.

values: It contains the column for which you want to populate the table’s cells.
index: The column used in it will become the row index of the pivot table, and each unique category of this column will become a row in the pivot table.
columns: It contains the headers of the pivot table, and each unique element will become the column in the pivot table.
aggfunc: This is the same aggregator function we discussed earlie

21
Q

Combining Data Frames

A

We can combine and merge several data frames either horizontally or vertically. It will concatenate two data frames and return a single merged data frame.

For ex-,

combined_df = pd.concat([df1, df2])

You can merge two data frames based on a common column. It is useful when you want to combine two data frames that share a common identifier.

For ex,

merged_df = pd.merge(df1, df2, on=”common_col”)

22
Q

Applying Custom Functions

A

Applying Custom Functions

You can apply custom functions according to your needs in either a row or a column.

For ex-,

def cus_fun(x):
return x * 3

df[“Sales_Tripled”] = df[“SALES”].apply(cus_fun, axis=0)

We have written a custom function that will triple the sales value for each row. axis=0 means that we want to apply the custom function on a column, and axis=1 implies that we want to apply the function on a row.

In the earlier method you have to write a separate function and then to call it from the apply() method. Lambda function helps you to use the custom function inside the apply() method itself. Let’s see how we can do that.

df[“Sales_Tripled”] = df[“SALES”].apply(lambda x: x * 3)

23
Q

Applymap

A

We can also apply a custom function to every element of the dataframe in a single line of code. But a point to remember is that it is applicable to all the elements in the dataframe.

For ex-,

df = df.applymap(lambda x: str(x))

It will convert the data type to a string of all the elements in the dataframe.

24
Q

Time Series Analysis

A

Time Series Analysis

In mathematics, time series analysis means analyzing the data collected over a specific time interval, and pandas have functions to perform this type of analysis.

Conversion to DateTime Object Model:

We can convert the date column into a datetime format for easier data manipulation.

For ex-,

df[“ORDERDATE”] = pd.to_datetime(df[“ORDERDATE”])

25
Q

Calculate Rolling Average

A

Using this method, we can create a rolling window to view data. We can specify a rolling window of any size. If the window size is 5, then it means a 5-day data window at that time. It can help you remove fluctuations in your data and help identify patterns over time.

For ex-

rolling_avg = df[“SALES”].rolling(window=5).mean()

26
Q

Cross Tabulation

A

We can perform cross-tabulation between two columns of a table. It is generally a frequency table that shows the frequency of occurrences of various categories. It can help you to understand the distribution of categories across different regions.

For ex-,

Getting a cross-tabulation between the COUNTRY and DEALSIZE.

cross_tab = pd.crosstab(df[“COUNTRY”], df[“DEALSIZE”])

27
Q

Handling Outliers

A

Q1 = df[“SALES”].quantile(0.25) Q3 = df[“SALES”].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = df[(df[“SALES”] < lower_bound) | (df[“SALES”] > upper_bound)]