Communicating Results Flashcards

Summarizing descriptive statistics, plotting visualizations, drawing conclustions, and customizing visuals to communicate results (66 cards)

1
Q

What is the .groupby() method

A

This allows you to group data by columns and aggregate info about groupings. The numeric_only excludes values that aren’t numeric.

df.groupby(“column_name”).mean(numeric_only=True)
or
df.groupby([“workclass”,”race”], as_index=False)[“capital-gain”].mean()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is summation or .sum()?

A

It aggregates data vertically .sum(axis=0) or horizontally .sum(axis=1).

df_census[[“capital_gain”,”capital-loss”]].sum()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Visualize how to get the sum while using .groupby and then sort the values in descending order

A

df.groupby(by=”column”).sum(numeric_only=True).sort_values(by=”column2”, ascending=False)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the measures of center?

A

Mean = .mean()
Median = .medain()
Mode = .mode()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the mean?

A

It is the average or sum of all numbers in set/by number of values in the set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the median

A

The center value in a set. Always sort the values first then calculate the median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the mode?

A

It is the value with the highest frequency in a set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If you had a column named color that contained the following values
df = { ‘color’ : [‘red,’ ‘blue,’ ‘red,’ ‘green,’ ‘blue,’ ‘blue’]}, what you would get if you used count as df[‘color’].value_counts()?

A

color : count
blue : 3
red : 2
green :1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does df[‘col’].values do?

A

It returns a NumPy array containing all the values in the column, including duplicates, in the order they appear. It contains only the raw data. no column labels or indexes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does df.index do?

A

.index returns the row labels of a data frame or a series. You can use .index to view or inspect or change the row labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does .columns do?

A

Provides an index object contianing the names of all the columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why does df.index output: RangeIndex(start=0, stop=3, step=1)?

A

Pandas assigns a numerical RangeIndex starting from 0 when you create a DataFrame without specifying row labels.

This describes the range of the row indices in the DataFrame:
start=0: The first row index starts at 0.
stop=3: The range stops before 3, so the indices are [0, 1, 2].
step=1: The indices increment by 1 between rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Visualize how to:
- Use .index to get the unique values.
- Use .values to get the corresponding counts.
- Combine .index and .values to iterate through the value counts.
- Use .sort_index() or .sort_values() to sort as needed.

A

The .index gives you the unique values(categories)
unique_values = count.index
print(unique_values)
The counts themselves are accessed using .values
counts_values = counts.values
print(counts_values)
Sort the results by unique values
sorted_counts = counts.sort_index()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is .zip()?

A

Zip is a python function that combines two or more iterables like lists, tuples or strings into a single iterator of tuples

ex.
list1 = [1,2,3]
list2 = [‘a’,’b’,’c’]
zipped = zip(list 1, list2)
print(list(zipped))
Output : [(1, ‘a’), (2, ‘b’), (3, ‘c’)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two ways to check if a column contains a value

A

A.
Uses the bitwise | to filter based on two conditions
var = df[(df[‘col’] == ‘data’) | (df[‘col’] == ‘data2’)]
B.
Uses the .isin() to check if the values exist in the column.
var = df[df[‘col’].isin([‘data1’, ‘data2’])]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the union of two events?

A

This calculates the probability that either event P(A) or P(B) happens (or both).
It’s P(A or B)
You add the probabilities of P(A) and P(B) because you’re counting all outcomes that include P(A) or P(B).
You subtract the overlap ((A∩B))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the formula for union of two events?

A

P (A or B) = P(A) + P(B) - P(A∩B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the intersection of two events

A

It’s P(A and B), calculating that the events A and B happen at the same time
You only consider the overlap between P(A) and P(B)—where both events occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula for the intersection of two events?

A

P(A and B) = P(A∩B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the conditional probability

A

This calculates the probability that A happens given B already happened. You’re “zooming in” on the subset of outcomes where P(B) happens and asking, “What proportion of those also include P(A)?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the formula of conditional probablity

A

P(A | B) = (A∩B)/P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Visualize how to find the count for unique values in a column?

A

filter for the unique values in the column

query_df = df.query(‘column == “data” or column == “data2”’)
#count the unique values
unique_column_value_count = query_df[‘column’].nunique()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Visualize how to convert the column names to lowercase

A

df[‘column’].str.lower()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain what each part of this code means

df.groupby(‘veh_class’).agg(mean_cmb_mpg=(‘cmb_mpg’,’mean’)).reset_index()

A

df.groupby(‘veh_class’) grows the rows by the veh_class column
mean_cmb_mpg is the name of the new column that will hold the mean values.
(‘cmb_mpg’, ‘mean’) specifies that you want to calculate the mean of the cmb_mpg column.
.reset_index() resets the index for the new dataframe created

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
T or F: you can use .agg() with NaNs
False. You have to use.dropna() or .fillna() first before calculating the mean.
26
Visualize how to calculate an increase for a column
# Now, let's merge the two DataFrames on 'veh_class' to compare them Calculate mean for both years veh_08 = df_08.groupby('veh_class').agg(mean_cmb_mpg_08=('cmb_mpg', 'mean')).reset_index() veh_18 = df_18.groupby('veh_class').agg(mean_cmb_mpg_18=('cmb_mpg', 'mean')).reset_index() inc = pd.merge(veh_08, veh_18, on='veh_class') # Calculate the increase in fuel economy inc['increase'] = inc['mean_cmb_mpg_18'] - inc['mean_cmb_mpg_08'] # Display inc
27
Explain what this code is doing? pd.DataFrame( {"year": ["2008", "2018"], "model_num": [smart_08["cmb_mpg"].mean(), smart_18["cmb_mpg"].mean()]} )
pd.DataFrame(...): This function creates a new DataFrame . Dictionary Input: The DataFrame is constructed from a dictionary where: "year": This key corresponds to a list of years, ["2008", "2018"]. "model_num": This key corresponds to the average combined MPG for SmartWay vehicles for 2008 and 2018. smart_08["cmb_mpg"].mean(): This calculates the mean of the cmb_mpg column in the smart_08 smart_18["cmb_mpg"].mean(): Similarly, this calculates the mean of the cmb_mpg column in the smart_18
28
Visualize how to convert " 1970-01-01 00:00:00.000000162" to a regular datetime
Sample DataFrame with runtime as datetime data = {'runtime': ['1970-01-01 00:00:00.000000162', '1970-01-01 00:00:00.000000120', '1970-01-01 00:00:00.00000090']} df_m = pd.DataFrame(data) # Convert 'runtime' to datetime df_m['runtime'] = pd.to_datetime(df_m['runtime']) # Function to convert datetime to "Xh Ym" format def format_runtime(dt): # Extract total seconds and convert to minutes total_minutes = int(dt.timestamp() // 60) # Get total minutes hours = total_minutes // 60 # Calculate hours minutes = total_minutes % 60 # Calculate remaining minutes return f"{hours}h {minutes}m" # Format as "Xh Ym" # Apply the function to the 'runtime' column df_m['runtime'] = df_m['runtime'].apply(format_runtime)
29
Visualize how to extract the name and genre from a library in a column [{"id": 28, "name": "Action"}
df_genre = df_m['genres'] # Create a new DataFrame df_g = pd.DataFrame({'genres': df_genre}) # Convert the string representation of lists to actual lists df_g['genres'] = df_g['genres'].apply(ast.literal_eval) # Extract the first genre's dictionary df_g['first_genre'] = df_g['genres'].apply(lambda x: x[0] if len(x) > 0 else None) # Create separate columns for the genre name and id df_g['first_genre_name'] = df_g['first_genre'].apply(lambda x: x['name'] if x is not None else None) df_g['first_genre_id'] = df_g['first_genre'].apply(lambda x: x['id'] if x is not None else None) # Display the resulting DataFrame print(df_g[['first_genre_name', 'first_genre_id']])
30
Visualize how to re-order data in a dataframe
df.reindex(columns=new_order) Create a list with the desired column order. Use that list to index the DataFrame or use the reindex() method. This allows you to easily change the order of columns in your DataFrame as needed.
31
What is the ast.literal_eval() function?
# Use ast.literal_eval to convert the string to a Python object part of the ast (Abstract Syntax Trees). You can use it to convert strings to a Python object or a list of dictionaries that you can work with in your code #import library import ast #String representation of a list of dictionaries data_string = "[{'name': 'Action', 'id': 28}, {'name': 'Adventure', 'id': 12}]" data_list = ast.literal_eval(data_string) print(data_list) # Output: [{'name': 'Action', 'id': 28}, {'name': 'Adventure', 'id': 12}]
32
how can you generate a random array with 25 rows and 5 columns? note: make sure it rounds to 2 decimal points.
np.random.rand(25,5).round(decimals=2)
33
.tolist()
Converts a string representation of a list into a Python list
34
.eval()
Converts the string representation of a list into a python list df['col1'] = "[0.53, 0.59, 0.45]" df['col1'] = df['col1'].apply(lambda u: eval(u)) df['col1'] = [0.53, 0.59, 0.45]
35
Visualize how to explode a list into rows
df_explode = df.explode(column="column_name")
36
Visualize how to explode data into multiple columns
df_new = pd.DataFrame(df_explode["column_name"].tolist(), columns=["col1","col2","col3"])
37
What chart displays data in bins and shows frequency of observations in bins
A histogram df.hist() or df['col_name'].hist()
38
What chart is good for categorical data?
Bar charts df.bar() or df['col'].bar()
39
What chart shows the relationship between two numerical variables on a two-dimensional graph?
Scatter Plot df.scatter() or df['col'].scatter() pd.plotting.scatter_matrix()
40
What chart displays the distribution of numerical data in min, 1st quart., median, 3rd quart., and max
Box plot df.box() or df['col'].box()
41
Visualize how to plot using the kind parameter
df.plot(x="col1", y="col2", kind=" ")
42
Visualize how to retrieve the lowest value of column 1 and find the corresponding value in column 2
.idxmin() #Find the index of the minimum for column 1 lowest_value_index = df['column 1'].idxmin() #Access the corresponding value in column 2 using that index lowest_value = df.loc[lowest_value_index, 'column 2']
43
What function returns the index for the minimum value of a column?
.idxmin()
44
What parameter allows you to set the size of the figure or plot in Pandas?
figsize .plot(figsize=(width,height))
45
what allows you to set the title for your Pandas plot?
df.plot(title='title of plot.')
46
How can you customize the labels of the x-axis and y-axis
df.plot(xlabel='X axis label') df.plot(ylabel='Y axis label') df.plot(ylim=(min_value, max_value)) #sets the min and max values for the y-axis
47
How do you change the color of the Pandas plot?
df.plot(color='color_name')
48
How do you provide the legend in a pandas plot
df.plot(legend=True)
49
Visualize the plotting of a bar chart in order
x = df['column'].value_counts().index df['column'].value_counts()[x].plot(kind='bar');
50
Define the .value_count()
Counts the unique values in a series and returns a new series containing the counts of each unique value.
51
what would occur if you used color = [blue, red, blue, blue, red, green] df['color'].value_counts()
color | count blue | 3 red | 2 green | 1
52
what would occur if you used color = [blue, red, blue, blue, red, green] value_counts = df['color'].value_counts() Print(value_counts.index) print(value_counts.values)
Index(['blue','red','green'], dtype='object') [3 2 1]
53
Visualize how to convert the data from .values to a Python list
values_numpy = df['column'].values python_values = values_numpy.values.tolist() print(python_values)
54
How do you filter rows based on multiple conditions?
Using the bitwise OR operator | x = df[(df['column'] == 'data) | (df_02['column'] == 'data2')]
55
Visualize how to use the .isin() to verify that a value is in a list of options
df[df['column'].isin(['column_1', 'column_2'])]
56
What should do before calculating the mean on a column
Check for NaN values and drop them df['column'].isnull().sum() df.dropna(subset=['column'])
57
Visualize how to calculate the means of two columns for years 2024 and 2025 from two different dataframes so that you can calculate the increase df_24['col_2024'] df_25['col_2025']
#Calculate the mean for both years value_24 = df24.groupby('data').agg(mean_col_2024=('col_2024','mean')).reset_index() value_25= df25.groupby('data').agg(mean_col_2025=('col_2025','mean')).reset_index() #merge the two dataframes on column 'data' merged_dfs = pd.merge(value_24, value_25, on='data') #Calculate the increase merged_dfs['increase'] = merged_dfs['mean_col_2024'] - merged_dfs['mean_col_2025'] #Display results merged_dfs
58
What does this code do? df.label.unique()
df refers to the dataframe .label accesses a column with the label within the df dataframe. .unique() returns an array of the unique values present in the label column output: array(['no','yes', dtype=object) You can then use the labels to filter df_filter = df.query('label == "yes"')
59
When can you use data.column or "dot notation?" ex. df.column.value_counts()
When the column name has no spaces, special characters, or numbers at the beginning When the column name conflicts with Dataframe such as columns or values When you want to use dynamic column access col_name - 'age' print(df.col_name)
60
How can I convert a date and time string to a datetime datatype?
# Sample DataFrame with runtime as datetime pd.to_datetime()
61
Visualize using a function to convert a string "Xh Ym" to a hours and minutes
Sample DataFrame with runtime as datetime data = {'runtime': ['1970-01-01 00:00:00.000000162', '1970-01-01 00:00:00.000000120', '1970-01-01 00:00:00.00000090']} df_m = pd.DataFrame(data) Convert 'runtime' to datetime df_m['runtime'] = pd.to_datetime(df_m['runtime']) Function to convert datetime to "Xh Ym" format def format_runtime(dt): # Extract total seconds and convert to minutes total_minutes = int(dt.timestamp() // 60) # Get total minutes hours = total_minutes // 60 # Calculate hours minutes = total_minutes % 60 # Calculate remaining minutes return f"{hours}h {minutes}m" # Format as "Xh Ym" Apply the function to the 'runtime' column df_m['runtime'] = df_m['runtime'].apply(format_runtime)
62
Visualize the steps to breaking up the below list into new columns from a dataframe titled df_m with the column labeled 'genres' [{"id": 28, "name": "Action"},{"id": 12, "name": "Adventure"}]
# Create a new DataFrame import pandas as pd import ast Assuming df_m is your original DataFrame # Creating a new DataFrame to clean up the genre column df_genre = df_m['genres'] df_g = pd.DataFrame({'genres': df_genre}) # Converting the string representation of lists into actual lists df_g['genres'] = df_g['genres'].apply(ast.literal_eval) # Exploding the genres column to get each genre in a separate row df_g = df_g.explode('genres') # Now extract the genre's name and id into separate columns df_g['genre_name'] = df_g['genres'].apply(lambda x: x['name'] if isinstance(x, dict) else None) df_g['genre_id'] = df_g['genres'].apply(lambda x: x['id'] if isinstance(x, dict) else None) # Dropping the original 'genres' column if you no longer need it df_g = df_g.drop(columns=['genres']) # Display the resulting DataFrame with genre names and IDs print(df_g[['genre_name', 'genre_id']])
63
How could you keep multiple descriptors for a value that would lead to producing duplicates i.e. a movie with multiple genres, a person with multiple names, a car with multiple colors
Aggregate Information: If you want to summarize the information instead of listing all combinations, you could aggregate the data. For example, you could concatenate the genre names into a single string for each movie. dataframe['genres'] = df_g.groupby('Movie Title')['genre_name'].transform(lambda x: ', '.join(x)) df_g = df_g.drop_duplicates(subset=['Movie Title']) # Keep only one row per movie
64
Visualize how to drop a row with a specific index
dataframe.drop(index=###, errors='ignore')
65
Visualize how to load and import sqlite3 into pandas
import sqlite3 conn = sqlite3.connect() c = conn.cursor() data = pd.read()
66
Visualize how to pass SQL statements using sqlite3
var = pd.read_sql("SQL clause", sqlite3.connect('file_path_to_database'))