Data analysis code Flashcards

Question

How do you display the last 5 rows of a DataFrame?

Answer 1

Use df.tail().

Answer 2

Use df.sort_values(by=['column_name'], ascending=False).

Answer 3

Use df.dropna(inplace=True).

Answer 4

It modifies the original DataFrame.

Answer 5

It creates a new DataFrame and leaves the original unchanged.

Answer 6

Use df.dropna(axis='columns', inplace=True)

Answer 7

Use df.dropna(thresh=2, inplace=True).

Answer 8

Use df.dropna(subset=['column1', 'column2'], inplace=True).

Answer 9

Use df.to_csv('data_name.filetype', index=False).

Answer 10

Use df.to_excel('data_name.filetype', index=False).

Answer 11

Use index=False in the to_csv() or to_excel() methods.

Answer 12

df['Column name'] = ['column contents', 'next value', ...]

Answer 13

df['Column name'] = df['Column name'].fillna(method="ffill")

Answer 14

df = df.drop(columns='column name')

Answer 15

df = df.drop(columns=['name 1', 'name 2'])

Answer 16

df = df.drop(row_number)

Answer 17

cols = df.columns.tolist() cols_new = [cols[1], cols[3], cols[2], cols[0]] df = df[cols_new]

Answer 18

cols_new = sorted(cols) then df = df[cols_new]

Answer 19

cols_new = sorted(cols, reverse=True) then df = df[cols_new]

Answer 20

df = df.sort_values(by=[col1, col2], ascending=False)

Answer 21

df['new column'] = df['old column'] * 50

Answer 22

df['new column'] = np.log(df['old column'])

Answer 23

df['new column'] = np.sqrt(df['old column'])

Answer 24

df['Soil_Drainage'] = df['Soil'] + '_' + df['Drainage']

Answer 25

df['new_column'] = df['old col'] + '_' + df['old col'].astype(str)

Answer 26

df[['new col1', 'new col2']] = df['old_column'].str.split('_', expand=True)

Answer 27

df[['new col1', 'new col2']] = df['old_column'].str.split('_', n=1, expand=True)

Answer 28

df[['new col1', 'new col2']] = df['old_column'].str.split('_20', expand=True)

Answer 29

data = data.drop_duplicates()

Answer 30

data = data.drop_duplicates().reset_index(drop=True)

Answer 31

df = df.T (This makes the index the column headers and the column headers the index).

Answer 32

Use df.reset_index(inplace=True) to reset the index, then df = df.drop(columns='index') to remove the newly created index column.

Answer 33

f.iloc[1:2] selects rows from position 1 up to but not including position 2.

Answer 34

Use df.dtypes to check the data types of each column.

Answer 35

Use df['col number'].astype(type you want) to cast a column to a different type.

Answer 36

Melting transforms data from wide form into long form, which is useful for certain types of analysis and visualization. Use pd.melt(dataframe, id_vars, value_vars, var_name='new_var_name', value_name='new_value_name').

Answer 37

id_vars: Columns to keep as is (typically identifiers). value_vars: Columns containing the values to melt. var_name: The name for the new variable column. value_name: The name for the new value column.

Answer 38

Melting makes the dataset more suitable for graphing and further analysis, especially for visualizing or performing operations on variables in a consistent, long format.

Answer 39

Pivoting (casting) reverses melting, converting a long-form DataFrame into a wide-form DataFrame by spreading values into new columns. Use pd.pivot(dataframe, columns='column header', index=['index header,...], values='data column').

Answer 40

columns: The column whose unique values become the new column headers. index: The column(s) that become the index of the DataFrame. values: The column containing the data to populate the new table.

Answer 41

Use df_cast.reset_index(inplace=True) to convert any multi-level index back into regular columns.

Answer 42

Use sns.relplot(x='sepal_length', y='petal_length', data=iris).

Answer 43

Add kind='line' to the sns.relplot function. The shaded area represents the 95% confidence interval.

Answer 44

Add hue='column_name' to the plot, where 'column_name' is the categorical variable

Answer 45

It creates a scatter plot with a linear regression line and a shaded 95% confidence interval.

Answer 46

Use sns.jointplot(x='sepal_length', y='petal_length', data=iris) or sns.pairplot(data=iris).

Answer 47

Use sns.heatmap(data, annot=True, fmt=".2f", cmap="viridis", cbar=True).

Answer 48

Use fig, axes = plt.subplots(1, 2, figsize=(10, 5)) for two plots in one row

Answer 49

Use sns.displot(x='sepal_length', data=iris). Add hue or col='column_name' for splitting based on a categorical variable.

Answer 50

Use sns.catplot(x='Species', y='petal_length', data=iris). You can add swarm or strip for different plot styles.

Answer 51

Add split=True to sns.violinplot to show both categories of the variable in the same plot.

Answer 52

Use sns.set_style('style'). Options: whitegrid: White background with grid darkgrid: Grey background with grid dark: Grey background without grid white: White background without grid

Answer 53

Always add a semicolon ; at the end of the Seaborn plotting code to suppress text like .

Answer 54

Use sns.set_context('context'). Options: paper: Smaller text for publication notebook: Default for notebooks talk: Larger for presentations poster: Largest for posters

Answer 55

To view: sns.color_palette('color name') To set: sns.set_palette('color name') (e.g., sns.set_palette('tab10') for the default color set)

Answer 56

Use sns.reset_defaults() to reset color and figure size to Seaborn’s default settings

Answer 57

Use sns.relplot(x='', y='', data=..., hue='', height=6, aspect=1.5). height: Plot height in inches aspect: Aspect ratio (width/height)

Answer 58

Use plt.figure(figsize=(width, height)). Example: plt.figure(figsize=(9, 6)) for 9x6 inches.

Answer 59

Title: g.fig.suptitle('Title', fontsize=..., y=...) Axis labels: g.set_axis_labels('x label', 'y label', fontsize=...) Set y-axis limits: g.set(ylim=(0, 8))

Answer 60

Title: plt.title('Title', fontsize=...) Remove legend: plt.legend(False) Axis labels: plt.xlabel('x label', fontsize=...), plt.ylabel('y label', fontsize=...) Customize legend: plt.legend(loc='', title='', frameon=False)

Answer 61

Use plt.savefig('filename.png') to save the plot as an image (you can specify other formats like .jpg, .svg, etc.).

Answer 62

Outer Join: Combines all rows from both dataframes, including non-overlapping rows. Inner Join: Includes only rows common to both dataframes. Left Join: Includes all rows from the left dataframe and matching rows from the right. Right Join: Includes all rows from the right dataframe and matching rows from the left.

Answer 63

df = dataframe1.merge(dataframe2).merge(dataframe3)

Answer 64

df = df1.merge(df2, on='column_name').merge(df3, left_on='df1_column', right_on='df2_column')

Answer 65

Combines multiple datasets into one structure. Key Parameters: axis: 0 (vertical) or 1 (horizontal). join: Type of join (default = outer). keys: Labels to identify data sources.

Answer 66

df = pd.concat([df1, df2], axis=0)

Answer 67

df = pd.concat([df1.set_index('col_name'), df2.set_index('col_name')], join='inner', axis=1)

Answer 68

df = dataframe1.merge(dataframe2, how='join_type', on='common_column') Join types: inner, outer, left, right.

Answer 69

df = df1.merge(df2, left_on='col_df1', right_on='col_df2', suffixes=['_df1', '_df2'])

Answer 70

Pandas appends _x (from dataframe1) and _y (from dataframe2) to differentiate. Use suffixes to customize labels.

Answer 71

merge(): Combines dataframes based on common columns. join(): Combines dataframes on their indexes.

Answer 72

By default, pd.concat() stacks dataframes vertically (axis=0).

Answer 73

Use the keys parameter: df = pd.concat([df1, df2], keys=['data1', 'data2'], axis=0)

Answer 74

df = df1.merge(df2, how='outer', on='common_column')

Answer 75

df = dataframe1.join(dataframe2, how='join_type', lsuffix='_left', rsuffix='_right') Default join type: Left join.

Answer 76

Use left_on and right_on parameters: df = df1.merge(df2, left_on='col1_df1', right_on='col2_df2')

Answer 77

Use the suffixes parameter: suffixes=['_df1', '_df2']

Answer 78

Outer join (default): Includes all indexes or columns. Inner join: Keeps only matching indexes or columns.

Answer 79

Set indexes with .set_index() before concatenation: df = pd.concat([df1.set_index('col'), df2.set_index('col')], axis=1)

Answer 80

Efficiency and clarity when combining multiple datasets in a single line: df = df1.merge(df2).merge(df3)

Answer 81

axis=0: Stacks dataframes vertically (rows). axis=1: Stacks dataframes horizontally (columns).

Answer 82

It returns the total number of rows or items in the structure, excluding the zeroth index.

Answer 83

Use count['column_name'].

Answer 84

Use double square brackets, e.g., df[['column1', 'column2']]. The columns will appear in the order specified.

Answer 85

Use count[1:5] (row 5 is excluded).

Answer 86

Use count[:7].

Answer 87

Use count[-1:].

Answer 88

Use integer positions, e.g., df.iloc[2, 3] for the value at row 2, column 3.

Answer 89

Use df.iloc[:, -1].

Answer 90

loc uses labels (row/column names), while iloc uses integer positions.

Answer 91

Use df.loc[2, 'field'].

Answer 92

Use df[df['column_name'] > 10].

Answer 93

Use df[df['column_name'].isin(['value1', 'value2'])]

Answer 94

Use syntax like df.query('Column == "Value"').

Answer 95

It splits a DataFrame into groups based on column(s) and performs operations on each group (e.g., grouping rows by soil types or drainage levels).

Answer 96

mean: Calculates the average for each group. max: Finds the maximum value for each group. min: Finds the minimum value for each group. sum: Adds up values for each group.

Answer 97

Use numeric_only=True to limit operations to numeric columns.

Answer 98

Syntax: df.groupby(['set']).mean() Example: Groups by the "set" column and calculates the mean, producing a matrix with rows for "control" and "experiment."

Answer 99

Use df.groupby('col_name').size() to get a list of unique values and their counts.

Answer 100

To split values in a column into multiple new columns based on a delimiter or character.

Answer 101

df[['new_col1', 'new_col2']] = df['original_col'].str.split('delimiter', n=number_of_splits, expand=True) delimiter: Character where the split occurs. n: Number of splits to perform. expand=True: Ensures output is split into separate columns.

Answer 102

df[['col1', 'col2']] = df['coding'].str.split('d', n=1, expand=True) Result: col1: "co" col2: "ding"

Answer 103

Always use expand=True to create multiple columns. Use the n parameter to control how many splits occur if the delimiter appears multiple times.

Answer 104

Increasing the sample size improves the precision of the correlation estimate, reduces uncertainty in the p-value, and makes the results more reliable.

Answer 105

result = pearsonr(df['col1'], df['col2']) print(f'r = {result.statistic:.2f}')

Answer 106

The p-value of 0.03 indicates there is a 3% chance that the correlation occurred by random chance, assuming there is no real correlation (null hypothesis). Since the p-value is low (< 0.05), we may reject the null hypothesis and consider the correlation statistically significant.

Answer 107

Statistic: A numerical property of a sample (e.g., sample correlation coefficient r). Parameter: A numerical property of the population (e.g., population correlation coefficient ρ).

Answer 108

The p-value helps assess the significance of the observed correlation. It represents the probability of observing a test statistic as extreme as the observed one, assuming the null hypothesis is true.

Answer 109

If the p-value < 0.05, we reject the null hypothesis (no correlation) and consider the correlation significant. If the p-value > 0.05, the correlation might be due to random chance and we fail to reject the null hypothesis.

Answer 110

ci = result.confidence_interval() ci.low, ci.high

Answer 111

The Shapiro-Wilk test assesses if the data follows a normal distribution. The null hypothesis is that the data comes from a normally distributed population.

Answer 112

result = Shapiro(df['col_name'])

Answer 113

to transform non-normally distributed data, apply a log transformation: df['log-variable'] = np.log10(df['col_name'])

Answer 114

A t-test for independent samples tests whether two independent samples have different means

Answer 115

result = ttest_ind(subset['col1'], subset['col2'])

Answer 116

ax.set_yscale('log')

Answer 117

The p-value gives the probability of observing a correlation of at least 0.46 by chance. If the p-value is small (e.g., < 0.05), it suggests that the observed correlation is statistically significant.

Answer 118

Swarm plots are used to compare a numerical variable with two categories, allowing us to visually assess the difference in means between those categories.

Answer 119

Scatterplots (lmplot) are ideal for visualizing the relationship between two numerical variables. It combines a scatter plot with a regression line to show the relationship.

Answer 120

The explanatory variable is the independent variable (cause). It is expected to explain the variation in the response variable.

Answer 121

The response variable is the dependent variable (effect). It is expected to change in response to the explanatory variable.

Answer 122

The relationship between the response and explanatory variables is expressed as: response_variable ~ explanatory_variable

Answer 123

The t-value measures how many standard errors the coefficient is away from zero. Example: t-value = 0.79.

Answer 124

The p-value indicates the probability that the coefficient is zero in the population. Example: p-value = 0.433. Since p > 0.05, we accept the null hypothesis (no significant effect).

Answer 125

The 95% CI is a range within which the true population parameter is likely to fall. Example: CI = (-0.25, 0.109). Since the CI does not contain 0, we cannot reject the null hypothesis.

Answer 126

The mean of the y-axis is given by the equation: Mean of y = intercept + slope * x

Answer 127

It is important to mention whether the relationship is linear or logarithmic and state if there is a statistically significant relationship.

Data analysis code Flashcards

(152 cards)