Exploring data Flashcards

Learning how to explore data after you clean it.

1
Q

Visualize how to get the variance of a dataframe using groupby

A

df.groupby(by=”col1”)[[“col2”,”col3”,”col4”]].var()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Visualize how to use .describe() on groups to get measurements by the percentiles parameter.

A

df.groupby(by=”col1”)[[“col2”,”col3”,”col4”]].describe(percentiles=[0.25,0.5,0.75])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a Histogram? Visualize how to create.

A

df.plot(kind=”hist”)

It displays the distribution of numerical data
It divides data into bins and shows frequency of observations in each bin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Pandas: Visualize how to create a bar chart

A

df.plot(kind=”bar”)

It compares different categories and shows values as bars of various lengths.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

matplotlib: Visualize how to create a pie chart

A

labels = “L1”, “L2”, “L3”
sizes = [10,20,25]

fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct=’%1.1f%%’, pctdistance=1.25, labeldistance=.6, colors=[“C1”,”C2”,C3”])

Use pctdistance and labeldistance if you want the percentages outside of the pct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A. Visualize how you can use .agg() on all columns
B. visualize how you can use .agg() on specific columns
C. Visualize how you can use .agg() using .groupby()
D. Visualize how to rename columns with .aagg

A

A. df.agg([‘mean’, ‘sum’, ‘max’])
B. df.agg({ ‘col1’: ‘mean’, ‘col2’: [‘sum’, ‘min’], ‘col3’ : lambda x: x.std()})
C. df.groupby(‘col_group’).agg({‘col1’: ‘mean’, ‘col2’: ‘sum’, ‘col3’ : ‘max’})
D. df.groupby(‘group_column’).agg(mean_col1=(‘col1’, ‘mean’), sum_col2=(‘col2’, ‘sum’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Visualize how to reset the index of a DataFrame

A

df.reset_index()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Visualize an example of how to use groupby to calculate mean

A

data = { ‘model’: [‘Car A’, ‘Car A’, ‘Car B’, ‘Car B’, ‘Car C’], ‘city_mpg’: [20, 22, 25, 27, 18]}
df = pd.DataFrame(data)
mean_mpg = df.groupby(‘model’)[‘city_mpg’].mean()
print(mean_mpg)
Output -
model city_mpg
Car A 21.0,
Car B 26.0 ,
Car C 18.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A. Visualize how to calculate the mean on a dataframe.
B. Visualize how to calculate the mean on a column

A

A. df.mean(numeric_only=True)
B. df.groupby(“col”).mean(numeric_only=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does standard deviation measure?

A

How much each point differs from the mean, or how spread out the data is.

.std()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is variance?

A

Variance helps us understand how the numbers in a group differ from the average, giving a sense of how scattered or clustered the data is.

.var()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are quantiles? And how do you use them?

A

Quantiles are values that split a group of data into equal parts.

df[[‘col1’,’col2’,’col3’]].qunatile(q=[.25,.50,.75,1])

You can change the percentages to be whatever you want.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What method would you use to show capital gains and capital loss?

A

dataframe[[“capital-gain”, “capital-loss”]].sum()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the different panda plotting methods?

A

A. df.hist(figsize=(#,#)); or df[col].hist(figsize=(8,8));
B. df.plot(kind=”box”, figsize=(#,#)) or df[col].plot(kind=”box”, figsize=(#,#))
C. df.bar() or df[col].bar()
D. df.pie() or df[col].pie()
E. pd.plotting.scatter_matrix()
F. df.scatter() or df[col].scatter()
G. df.box() or df[col].box()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How would you plot a bar chart with the value counts?

A

df[‘col’].value_counts().plot(kind=’bar’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Using the df.plot(kind=’scatter’), visualize how to specify what axis each column will be plotted.

A

df.plot(x=’col1’, y=’col2’, kind=’scatter’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A. Visualize how to use .hist() return a matplotlib subplot, change its transparency, and change the figure size
B. Visualize how to layer a new histogram using the same subplot that was returned.

A

A. ax = df_A[‘col1’].hist(alpha=0.5, figsize=(#,#,), label = ‘title1’);
B. df_B[‘col1’].hist(alpha=0.5, figsize=(#,#,), label=’title2’, ax=ax);
C.
ax.set_title(‘TITLE’);
ax.set_xlabel(‘X-AXIS TITLE’);
ax.set_ylabel(‘Y-AXIS TiTLE’);
ax.legend(loc=’upper right’);

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

If you had a column containing the sales for each week for an entity, how would you find the row corresponding to the minimum sales or the worst week?

A

.idxmin()

ex:
#Step 1: find the index of the minimum sales
worst_week_index = df[‘col’].idxmin()
#Step 2: Access the corresponding week
worst_week = df.loc[worst_week_index, ‘week’]
#Print
print(f”The worst week is: {worst_week}”}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A. Visualize how to customize the x-axis and the y-axis of your Pandas plots.
B. Visualize how to set the minimum and maximum values of your Pandas plot

A

A.
df.plot(xlabel=’X Axis Label’)
df.plot(ylabel=’Y Axis Label’)

B.
df.plot(ylim=(min_value,max_value))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Visualize how to set the color for your Pandas plot

A

df.plot(color=’color_name’)

You can use color names or hexadecimal color codes.

21
Q

Visualize how to set the legend for your Pandas plot

A

df.plot(legend=True)

22
Q

How can you use .index with .value_counts()?

A

You can use .index with .value_counts to access the unique values in a column along with their counts

EX.
# Example DataFrame
data = {‘fruits’: [‘apple’, ‘banana’, ‘apple’, ‘orange’, ‘banana’, ‘apple’]}
df = pd.DataFrame(data)
# Calculate value counts and index
counts_index = df[‘fruits’].value_counts().index
#Print
print(counts_index)
#Output
Index([‘apple’, ‘banana’, ‘orange’], dtype=’object’)

23
Q

What does the .index do?

A

The .index attribute gives you the unique values (the categories)

24
Q

What does .values do?

A

The .values attribute returns a NumPy array including duplicates, in order of appearance.
This is useful for nummerical computations or array manipulations or for converting to Python lists such as df[‘col’].values.tolist()

25
What is the difference between a pandas series and a NumPy array?
Panda series are designed for labeled data which makes it easier to work with tabular data, and is built on top of NumPy arrays. NumPy Arrays are designed for numerical computations and linear algebra operations. This has no labels or metadata.
26
What do you do when you encounter a column that contains rows with lists?
Use the .explode() method to distribute the values across multiple rows in a column df.explode(column="column_name")
27
What do you need to do if you want to use .explode() and then distribute the values across multiple columns?
#First flatten the list into individual rows in a column df_explode = df.explode(column="column_name") #Then split the data into multiple columns df_new = pd.DataFrame(df_explode["column_name"].tolist(), columns=["col1", "col2", "col3"])
28
What does .tolist() do?
This conversts a pandas Series or a DataFrame column into a python list
29
Explain what pd.Dataframe({"list_values":[str(a) for a in array.tolist()]}) does
- array.tolist() convers the NumPy array into a python list - [str(a) for a in array.tolist()] irerates over each list a in the list and converst list a into its string representation. - pd.DataFrame() converts the data into a column with the name list_values with each entry a corresponding string representation of the list from the original NumPy array
30
Explain what pd.DataFrame(dataframe.column.apply(lambda u: eval(u)).values.tolist()) accomplishes on a column that contains string representations of lists
- dataframe.column is the column that we are accessing that contains the string list. - .apply (lambda u: eval(u)) uses the apply method to apply the lambda function u: eval(u) to the dataframe.column - .eval() converts the string representation of the list back into a python list. - .values retries the underlying data as a NumPy array, which will now be a 1D array of lists - .tolist() converts the NumPy array of lists into a Python list of lists - pd.DataFrame creates a new DataFrame from the list of lists with each inner list becoming a row in the new DataFrame
31
If I wanted to take.a column containing rows of string representations and convert it back into actual lists, what code would I use?
pd.DataFrame(datafame.column.apply(lambda u: eval(u)).values.tolist())
32
np.where()
np.where(condition, value_if_true, value_if_false) It goes through each element in your condition. If the condition is True, it puts the first value. If the condition is False, it puts the second value.
33
How can you use np.where() to trap divide-by-zero?
a smart if-else for arrays df["new_column"] = np.where( df["some_column"] != 0, # condition df["numerator_column"] / df["denominator_column"], # value if true np.nan # value if false )
34
How can you use .apply() to trap divide-by-zero?
def safe_divide(row): if row["total_revenue"] != 0: return row["total_long-term_debt"] / row["total_revenue"] else: return np.nan # Or 0 or any other fallback value dfc["debt-to-income"] = dfc.apply(safe_divide, axis=1)
35
Q: How do I apply a function row-by-row in a DataFrame?
A: It applies a function to each row or column in a DataFrame or Series.
36
Q: How do I apply a function row-by-row in a DataFrame?
A: Use .apply(my_function, axis=1) where my_function accepts a row. df["result"] = df.apply(lambda row: row["a"] + row["b"], axis=1)
37
How do I create a column that flags businesses with revenue below 100,000 using .apply()?
Use a lambda function with .apply(): df["low_revenue"] = df["total_revenue"].apply(lambda x: 1 if x < 100000 else 0)
38
When should you use .apply() over np.where()?
Use .apply() for custom or row-based logic; use np.where() for simple column-wise conditions.
39
How do I safely create a column from dividing two other columns?
df["ratio"] = np.where(df["total_revenue"] != 0, df["debt"]/df["total_revenue"], np.nan)
40
How can I flag high profit margin businesses with np.where()?
df["high_profit_flag"] = np.where(df["profit_margin"] > 0.25, 1, 0)
41
How do I flag rows that meet multiple conditions in pandas?
df["flag"] = np.where((df["debt_to_equity"] > 1) & (df["profit_margin"] < 0), 1, 0)
42
How do I create a column that labels a company as "risky" if its debt-to-equity ratio is above 1.5?
df["risk_label"] = np.where(df["debt_to_equity"] > 1.5, "risky", "safe")
43
vectorization
The process of applying an operation over an entire array such as with apply, np.where(), or lambda
44
Convert this function into a lambda function: def add5(x): return x + 5 df = student_data[['G1', 'G2','G3' ]].apply(add5)
df = student_data[['G1', 'G2', 'G3']].apply(lambda x: x+5)
45
.eq()
df['col'].eq(num) finds all values that equal the number in the column
46
T or F: The below code is correct: if dfc["total_revenue"] != 0: dfc["debt_to_income"] = dfc.apply( dfc["total_long-term_debt"]/dfc["total_revenue"]) else: np.nan
False. .apply() expects a function. df["new_col"] = df.apply( lambda row: row["num"] / row["den"] if row["den"] != 0 else np.nan, axis=1 )
47
How do I compare two Series row-by-row and filter based on a greater-than condition in pandas?
Use .gt() with axis=0 to compare each row individually: filtered_df = df1[df1['col1'].gt(df2['col1'], axis=0)]
48
How do I find where one Series is less than another, row-by-row?
Use .lt() with axis=0 for row-wise comparison: filtered_df = df1[df1['col1'].lt(df2['col1'], axis=0)]