Data analysis code Flashcards

(152 cards)

1
Q

What does axis=1 do in a DataFrame operation?

A

It moves along rows (horizontally)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does axis=0 do in a DataFrame operation?

A

It moves down columns (vertically).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the purpose of the Shapiro-Wilk test?

A

It tests whether the data is normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the interpretation of a p-value greater than 0.05 in the Shapiro-Wilk test?

A

Data is normally distributed; fail to reject the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does df.shape return?

A

A tuple with the number of rows and columns: (number of rows, number of columns).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What statistics does df.describe() provide?

A

It provides summary statistics: Count, Mean, Standard Deviation, Minimum, Quartiles (25%, 50%, 75%), and Maximum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does df.mean() do in a DataFrame?

A

It computes the mean of each column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you access a specific column in a DataFrame?

A

Use the column name: df[‘column_name’].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the code df[‘column_name’].max() do?

A

It returns the maximum value in the specified column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What assumptions does the paired t-test make?

A

It assumes no major outliers, independent observations, continuous dependent variable, and normally distributed dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you read a CSV file into a DataFrame in Python?

A

df = pd.read_csv(‘../folder/name.filetype’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you read an Excel file into a DataFrame in Python?

A

df = pd.read_excel(‘file_path’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you read a tab-separated file into a DataFrame?

A

Use the sep=’\t’ parameter in pd.read_csv().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you handle missing data when reading a file?

A

Use na_values=’’ to replace ‘’ with NaN.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you specify the data type for integer columns in a DataFrame?

A

Use dtype=pd.Int64Dtype() to convert float to integers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do you rename columns when reading a file into a DataFrame?

A

Use header=None, names=[‘column1’, ‘column2’, …] when reading the file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you skip rows from the top when reading a file?

A

Use the skiprows=… parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you skip rows from the bottom when reading a file?

A

Use the skipfooter=… parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you set a specific column as the index when reading a file?

A

Use index_col=1 to set the second column as the index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do you update a specific value in a DataFrame?

A

Use df.at[‘row_name’, ‘column_name’] = new_value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does df.info() display?

A

It displays the number of entries, index range, columns, and non-null count per column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the data type float64 represent?

A

It represents decimal numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the data type int64 represent?

A

It represents whole numbers (integers).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What does the data type object represent?

A

It represents strings or words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How do you display the last 5 rows of a DataFrame?
Use df.tail().
26
How do you sort the values in a DataFrame by a specific column?
Use df.sort_values(by=['column_name'], ascending=False).
27
How do you drop rows with missing values from a DataFrame?
Use df.dropna(inplace=True).
28
What is the effect of using inplace=True in df.dropna()?
It modifies the original DataFrame.
29
What is the effect of using inplace=False in df.dropna()?
It creates a new DataFrame and leaves the original unchanged.
30
How do you drop columns with missing values from a DataFrame?
Use df.dropna(axis='columns', inplace=True)
31
How do you drop rows that have less than 2 non-missing values?
Use df.dropna(thresh=2, inplace=True).
32
How do you drop rows based on missing values in specific columns?
Use df.dropna(subset=['column1', 'column2'], inplace=True).
33
How do you save a DataFrame as a CSV file?
Use df.to_csv('data_name.filetype', index=False).
34
How do you save a DataFrame as an Excel file?
Use df.to_excel('data_name.filetype', index=False).
35
How do you exclude the index when saving a DataFrame to a file?
Use index=False in the to_csv() or to_excel() methods.
36
How do you add a new column to a DataFrame?
df['Column name'] = ['column contents', 'next value', ...]
37
How do you forward-fill NaN values in a column?
df['Column name'] = df['Column name'].fillna(method="ffill")
38
How do you remove a single column from a DataFrame?
df = df.drop(columns='column name')
39
How do you remove multiple columns from a DataFrame?
df = df.drop(columns=['name 1', 'name 2'])
40
How do you remove a specific row by its index?
df = df.drop(row_number)
41
How do you rearrange columns into a custom order?
cols = df.columns.tolist() cols_new = [cols[1], cols[3], cols[2], cols[0]] df = df[cols_new]
42
How do you sort the columns alphabetically?
cols_new = sorted(cols) then df = df[cols_new]
43
How do you sort the columns in reverse alphabetical order?
cols_new = sorted(cols, reverse=True) then df = df[cols_new]
44
How do you sort values by multiple columns in descending order?
df = df.sort_values(by=[col1, col2], ascending=False)
45
How do you create a new column by multiplying an existing column by a scalar?
df['new column'] = df['old column'] * 50
46
How do you apply a logarithmic transformation to a column?
df['new column'] = np.log(df['old column'])
47
How do you apply a square root transformation to a column?
df['new column'] = np.sqrt(df['old column'])
48
How do you concatenate two string columns in a DataFrame?
df['Soil_Drainage'] = df['Soil'] + '_' + df['Drainage']
49
How do you convert numerical data to strings before concatenating?
df['new_column'] = df['old col'] + '_' + df['old col'].astype(str)
50
How do you split a column into multiple columns based on a delimiter?
df[['new col1', 'new col2']] = df['old_column'].str.split('_', expand=True)
51
How do you limit the number of splits to just once when splitting a column?
df[['new col1', 'new col2']] = df['old_column'].str.split('_', n=1, expand=True)
52
How do you use a specific substring for splitting a column?
df[['new col1', 'new col2']] = df['old_column'].str.split('_20', expand=True)
53
How do you remove duplicate rows from a DataFrame?
data = data.drop_duplicates()
54
How do you reset the index after dropping duplicates from a DataFrame?
data = data.drop_duplicates().reset_index(drop=True)
55
How do you transpose a DataFrame to swap rows and columns?
df = df.T (This makes the index the column headers and the column headers the index).
56
How do you reset the index and move it to a column in a DataFrame?
Use df.reset_index(inplace=True) to reset the index, then df = df.drop(columns='index') to remove the newly created index column.
57
How do you select a range of rows using iloc in a DataFrame?
f.iloc[1:2] selects rows from position 1 up to but not including position 2.
58
How do you check the data types of each column in a DataFrame?
Use df.dtypes to check the data types of each column.
59
How do you change the data type of a column in a DataFrame?
Use df['col number'].astype(type you want) to cast a column to a different type.
60
What is the purpose of melting a DataFrame, and how do you do it?
Melting transforms data from wide form into long form, which is useful for certain types of analysis and visualization. Use pd.melt(dataframe, id_vars, value_vars, var_name='new_var_name', value_name='new_value_name').
61
What parameters do you need to specify when melting a DataFrame?
id_vars: Columns to keep as is (typically identifiers). value_vars: Columns containing the values to melt. var_name: The name for the new variable column. value_name: The name for the new value column.
62
What is the benefit of melting a DataFrame?
Melting makes the dataset more suitable for graphing and further analysis, especially for visualizing or performing operations on variables in a consistent, long format.
63
What is casting (pivoting) in DataFrame manipulation, and how do you do it?
Pivoting (casting) reverses melting, converting a long-form DataFrame into a wide-form DataFrame by spreading values into new columns. Use pd.pivot(dataframe, columns='column header', index=['index header,...], values='data column').
64
What parameters do you need to specify when pivoting a DataFrame?
columns: The column whose unique values become the new column headers. index: The column(s) that become the index of the DataFrame. values: The column containing the data to populate the new table.
65
How do you reset the index after pivoting a DataFrame?
Use df_cast.reset_index(inplace=True) to convert any multi-level index back into regular columns.
66
How do you create a relational plot to show the relationship between two numerical variables?
Use sns.relplot(x='sepal_length', y='petal_length', data=iris).
67
How do you turn a relational plot into a line plot with a confidence interval?
Add kind='line' to the sns.relplot function. The shaded area represents the 95% confidence interval.
68
How do you differentiate points in a relational plot by a categorical variable using color?
Add hue='column_name' to the plot, where 'column_name' is the categorical variable
69
What does sns.lmplot do?
It creates a scatter plot with a linear regression line and a shaded 95% confidence interval.
70
How do you create a composite plot with both categorical and numerical variables?
Use sns.jointplot(x='sepal_length', y='petal_length', data=iris) or sns.pairplot(data=iris).
71
How do you create a heatmap in Seaborn?
Use sns.heatmap(data, annot=True, fmt=".2f", cmap="viridis", cbar=True).
72
How do you create multiple plots side by side?
Use fig, axes = plt.subplots(1, 2, figsize=(10, 5)) for two plots in one row
73
How do you create a distribution plot (histogram) for numerical variables?
Use sns.displot(x='sepal_length', data=iris). Add hue or col='column_name' for splitting based on a categorical variable.
74
How do you explore relationships between categorical variables using a categorical plot?
Use sns.catplot(x='Species', y='petal_length', data=iris). You can add swarm or strip for different plot styles.
75
How do you customize violin plots to display both categories of a variable on the same plot?
Add split=True to sns.violinplot to show both categories of the variable in the same plot.
76
How do you set different Seaborn plot styles?
Use sns.set_style('style'). Options: whitegrid: White background with grid darkgrid: Grey background with grid dark: Grey background without grid white: White background without grid
77
How do you suppress the top text on a Seaborn plot?
Always add a semicolon ; at the end of the Seaborn plotting code to suppress text like .
78
How do you set the context for a Seaborn plot?
Use sns.set_context('context'). Options: paper: Smaller text for publication notebook: Default for notebooks talk: Larger for presentations poster: Largest for posters
79
How do you view and set color palettes in Seaborn?
To view: sns.color_palette('color name') To set: sns.set_palette('color name') (e.g., sns.set_palette('tab10') for the default color set)
80
How do you reset Seaborn’s default settings?
Use sns.reset_defaults() to reset color and figure size to Seaborn’s default settings
81
How do you control the size of figure-level plots (e.g., relplot, displot)?
Use sns.relplot(x='', y='', data=..., hue='', height=6, aspect=1.5). height: Plot height in inches aspect: Aspect ratio (width/height)
82
How do you control the size of axes-level plots (e.g., scatterplot, boxplot)?
Use plt.figure(figsize=(width, height)). Example: plt.figure(figsize=(9, 6)) for 9x6 inches.
83
How do you customize titles and axis labels for multiple plots in Seaborn?
Title: g.fig.suptitle('Title', fontsize=..., y=...) Axis labels: g.set_axis_labels('x label', 'y label', fontsize=...) Set y-axis limits: g.set(ylim=(0, 8))
84
How do you customize individual plots and legends in Seaborn and Matplotlib?
Title: plt.title('Title', fontsize=...) Remove legend: plt.legend(False) Axis labels: plt.xlabel('x label', fontsize=...), plt.ylabel('y label', fontsize=...) Customize legend: plt.legend(loc='', title='', frameon=False)
85
How do you save a plot as an image?
Use plt.savefig('filename.png') to save the plot as an image (you can specify other formats like .jpg, .svg, etc.).
86
What are the four types of joins in Pandas?
Outer Join: Combines all rows from both dataframes, including non-overlapping rows. Inner Join: Includes only rows common to both dataframes. Left Join: Includes all rows from the left dataframe and matching rows from the right. Right Join: Includes all rows from the right dataframe and matching rows from the left.
87
How do you combine multiple merge() commands in a single line?
df = dataframe1.merge(dataframe2).merge(dataframe3)
88
How do you specify custom keys for merging dataframes in Pandas?
df = df1.merge(df2, on='column_name').merge(df3, left_on='df1_column', right_on='df2_column')
89
What does pd.concat() do, and what are its key parameters?
Combines multiple datasets into one structure. Key Parameters: axis: 0 (vertical) or 1 (horizontal). join: Type of join (default = outer). keys: Labels to identify data sources.
90
What is the syntax for a vertical concatenation of dataframes?
df = pd.concat([df1, df2], axis=0)
91
How do you concatenate dataframes horizontally with an inner join?
df = pd.concat([df1.set_index('col_name'), df2.set_index('col_name')], join='inner', axis=1)
92
How do you merge two dataframes on a common column?
df = dataframe1.merge(dataframe2, how='join_type', on='common_column') Join types: inner, outer, left, right.
93
How do you merge dataframes with custom join columns?
df = df1.merge(df2, left_on='col_df1', right_on='col_df2', suffixes=['_df1', '_df2'])
94
What happens when matching columns overlap during merging?
Pandas appends _x (from dataframe1) and _y (from dataframe2) to differentiate. Use suffixes to customize labels.
95
What is the difference between merge() and join() in Pandas?
merge(): Combines dataframes based on common columns. join(): Combines dataframes on their indexes.
96
What is the default behavior of pd.concat() when the axis parameter is not specified?
By default, pd.concat() stacks dataframes vertically (axis=0).
97
How do you add labels to identify the source of data in concatenation?
Use the keys parameter: df = pd.concat([df1, df2], keys=['data1', 'data2'], axis=0)
98
What is the syntax for an outer merge?
df = df1.merge(df2, how='outer', on='common_column')
99
How do you combine dataframes using join()?
df = dataframe1.join(dataframe2, how='join_type', lsuffix='_left', rsuffix='_right') Default join type: Left join.
100
How do you merge dataframes when they share no common column?
Use left_on and right_on parameters: df = df1.merge(df2, left_on='col1_df1', right_on='col2_df2')
101
How do you differentiate overlapping columns from merged dataframes?
Use the suffixes parameter: suffixes=['_df1', '_df2']
102
What happens when concatenating dataframes with different indexes or columns?
Outer join (default): Includes all indexes or columns. Inner join: Keeps only matching indexes or columns.
103
How do you align dataframes by their column names for concatenation?
Set indexes with .set_index() before concatenation: df = pd.concat([df1.set_index('col'), df2.set_index('col')], axis=1)
104
What is a key advantage of chaining merge() commands?
Efficiency and clarity when combining multiple datasets in a single line: df = df1.merge(df2).merge(df3)
105
What is the difference between axis=0 and axis=1 in concatenation?
axis=0: Stacks dataframes vertically (rows). axis=1: Stacks dataframes horizontally (columns).
106
107
What does the len() function do in Python when applied to a DataFrame or list?
It returns the total number of rows or items in the structure, excluding the zeroth index.
108
How do you access a column’s values in a DataFrame by its name?
Use count['column_name'].
109
How can you isolate specific columns in a DataFrame?
Use double square brackets, e.g., df[['column1', 'column2']]. The columns will appear in the order specified.
110
How do you create a subset of rows from rows 1 to 4 using indexing?
Use count[1:5] (row 5 is excluded).
111
How would you display the first 7 rows of a DataFrame?
Use count[:7].
112
How would you display only the last row of a DataFrame?
Use count[-1:].
113
How do you access a specific value in a DataFrame using iloc?
Use integer positions, e.g., df.iloc[2, 3] for the value at row 2, column 3.
114
How do you select all rows but only the last column using iloc?
Use df.iloc[:, -1].
115
How does loc differ from iloc?
loc uses labels (row/column names), while iloc uses integer positions.
116
How do you retrieve a specific value at row 2 and column 'field' using loc?
Use df.loc[2, 'field'].
117
How can you filter rows based on a column value being greater than 10?
Use df[df['column_name'] > 10].
118
How do you extract rows with specific values in a column using .isin()?
Use df[df['column_name'].isin(['value1', 'value2'])]
119
How do you filter rows based on a string condition using .query()?
Use syntax like df.query('Column == "Value"').
120
What is the purpose of the groupby function in Pandas?
It splits a DataFrame into groups based on column(s) and performs operations on each group (e.g., grouping rows by soil types or drainage levels).
121
What are some common aggregation functions used with groupby?
mean: Calculates the average for each group. max: Finds the maximum value for each group. min: Finds the minimum value for each group. sum: Adds up values for each group.
122
How do you avoid warnings when using groupby on columns with mixed data types?
Use numeric_only=True to limit operations to numeric columns.
123
How can you calculate the mean of grouped data using groupby?
Syntax: df.groupby(['set']).mean() Example: Groups by the "set" column and calculates the mean, producing a matrix with rows for "control" and "experiment."
124
How do you count the occurrences of unique values in a column using groupby?
Use df.groupby('col_name').size() to get a list of unique values and their counts.
125
What is the purpose of splitting columns in a DataFrame?
To split values in a column into multiple new columns based on a delimiter or character.
126
What is the syntax for splitting a column into multiple columns?
df[['new_col1', 'new_col2']] = df['original_col'].str.split('delimiter', n=number_of_splits, expand=True) delimiter: Character where the split occurs. n: Number of splits to perform. expand=True: Ensures output is split into separate columns.
127
How do you split the string "coding" at the letter d?
df[['col1', 'col2']] = df['coding'].str.split('d', n=1, expand=True) Result: col1: "co" col2: "ding"
128
What are two key notes about splitting columns?
Always use expand=True to create multiple columns. Use the n parameter to control how many splits occur if the delimiter appears multiple times.
129
What is the purpose of increasing sample size in correlation analysis?
Increasing the sample size improves the precision of the correlation estimate, reduces uncertainty in the p-value, and makes the results more reliable.
130
How do you calculate the Pearson correlation coefficient in Python?
result = pearsonr(df['col1'], df['col2']) print(f'r = {result.statistic:.2f}')
131
What does a Pearson correlation coefficient r value of 0.46 and a p-value of 0.03 indicate?
The p-value of 0.03 indicates there is a 3% chance that the correlation occurred by random chance, assuming there is no real correlation (null hypothesis). Since the p-value is low (< 0.05), we may reject the null hypothesis and consider the correlation statistically significant.
132
What is the difference between a statistic and a parameter?
Statistic: A numerical property of a sample (e.g., sample correlation coefficient r). Parameter: A numerical property of the population (e.g., population correlation coefficient ρ).
133
What is the purpose of the p-value in hypothesis testing?
The p-value helps assess the significance of the observed correlation. It represents the probability of observing a test statistic as extreme as the observed one, assuming the null hypothesis is true.
134
How do you interpret a p-value in correlation analysis?
If the p-value < 0.05, we reject the null hypothesis (no correlation) and consider the correlation significant. If the p-value > 0.05, the correlation might be due to random chance and we fail to reject the null hypothesis.
135
How can you calculate a 95% confidence interval for a Pearson correlation?
ci = result.confidence_interval() ci.low, ci.high
136
What is the purpose of the Shapiro-Wilk test?
The Shapiro-Wilk test assesses if the data follows a normal distribution. The null hypothesis is that the data comes from a normally distributed population.
137
How do you apply the Shapiro-Wilk test in Python?
result = Shapiro(df['col_name'])
138
How do you handle non-normally distributed data in Python?
to transform non-normally distributed data, apply a log transformation: df['log-variable'] = np.log10(df['col_name'])
139
What is the purpose of a t-test for independent samples?
A t-test for independent samples tests whether two independent samples have different means
140
How do you perform a t-test for independent samples in Python?
result = ttest_ind(subset['col1'], subset['col2'])
141
How do you modify the y-axis scale for large values in a scatter plot?
ax.set_yscale('log')
142
What is the probability of observing a correlation of at least 0.46, assuming no real correlation exists?
The p-value gives the probability of observing a correlation of at least 0.46 by chance. If the p-value is small (e.g., < 0.05), it suggests that the observed correlation is statistically significant.
143
What is the purpose of a swarm plot?
Swarm plots are used to compare a numerical variable with two categories, allowing us to visually assess the difference in means between those categories.
144
What is the best use of a scatterplot (lmplot)?
Scatterplots (lmplot) are ideal for visualizing the relationship between two numerical variables. It combines a scatter plot with a regression line to show the relationship.
145
What is an explanatory variable in regression?
The explanatory variable is the independent variable (cause). It is expected to explain the variation in the response variable.
146
What is a response variable in regression?
The response variable is the dependent variable (effect). It is expected to change in response to the explanatory variable.
147
What is the model formula for regression?
The relationship between the response and explanatory variables is expressed as: response_variable ~ explanatory_variable
148
What does the t-value represent in regression?
The t-value measures how many standard errors the coefficient is away from zero. Example: t-value = 0.79.
149
What does the p-value indicate in regression?
The p-value indicates the probability that the coefficient is zero in the population. Example: p-value = 0.433. Since p > 0.05, we accept the null hypothesis (no significant effect).
150
What is a 95% Confidence Interval (CI) in regression?
The 95% CI is a range within which the true population parameter is likely to fall. Example: CI = (-0.25, 0.109). Since the CI does not contain 0, we cannot reject the null hypothesis.
151
How do you interpret the relationship between the y-axis and x-axis in linear models?
The mean of the y-axis is given by the equation: Mean of y = intercept + slope * x
152
What must be mentioned in the conclusion when interpreting relationships?
It is important to mention whether the relationship is linear or logarithmic and state if there is a statistically significant relationship.