Midterm Flashcards

Question

how to subset a dataframe to match 2 conditions

Answer 1

h[cond1 & cond2] h[cond1 | cond2]

Answer 2

h1 = h[h['state'].isin(['north', 'virginia', 'arizona'])]

Answer 3

df.plot(xcol, ycol, kind, title) plt.show() incase of histogram df[col].hist()

Answer 4

df.isna().sum() (can be plotted using .plot(kind='bar'))

Answer 5

df.fillna(0)

Answer 6

practice and study of collection and analyzing data to derive a fact or a summary

Answer 7

descriptive statistics: describe and summarize data inferential statistics: use a sample of data to make inferences about a larger population

Answer 8

numeric (quantitative): continuous (measured) discrete (counted) categorical (qualitative): nominal (unordered) ordinal (ordered) : strongly agree>

Answer 9

mean (average): sum/total samples median: the value where 50% of the data is above it mode: the most frequent value in the data

Answer 10

left skewed histogram is when the tail of histogram is to the left of the mean/median indicating a high concentration of low value entries (right skewed is the opposite)

Answer 11

df.groupby('country')['consumption'].agg([np.mean, np.median])

Answer 12

np.sqrt(np.var(df['co'], ddof=1)) np.std(df['col'], ddof=1) note that when ddof = 0, the data is a sample size when ddof = 1, the data is all the possible population of what we are calculating

Answer 13

spllit the data into some number of equal parts np.quantile(df[col], 0.5)

Answer 14

interquartile range: another measure of spread, it's the distance between the 25th and 75th percentile

Answer 15

a data point that is largely different from the others a data point is an outlier if: data< Q1 - 1.5 x IQR or data > Q3 + 1.5 x IQR

Answer 16

np.allclose(A, B)

Answer 17

df.sample(n) or df.sample(n, replace=False) to completely remove the sampled entry

Answer 18

initializes the initial number used for pseudorandom calculation so that we get the same random numbers on every run of the code

Answer 19

uniform.rvs(start, end, arraySize)

Answer 20

uniform.cdf(end, start, probability)

Answer 21

when the trials are not independent, the binomial distribution does not apply

Answer 22

norm.ppf(percent, mean, variance)

Answer 23

pmf, probaility at x cdf probability up to x

Answer 24

a number that defines the relationship between x and y [-1, 1] if it is close to 0, a weaker relationship exists df[x],corr(df[y]) could be used for linear regression

Answer 25

import seaborn as sns sns.lmplot(x, y, data, ci) plt.show()

Answer 26

tables that are derived from original tables df.pivot_table(values=, index=, aggfunc=[np.mean, np.median]) aggfunc is optional and can be omitted can also add columns=boolcol to calculate the mean of values for that bool col

Answer 27

no missing values data in numeric format data stored in padas dataframe o numpy array

Answer 28

predict the label of any data point by looking at the k closest labeled data points and getting them to vote on what label the unlabeled observation should have

Answer 29

high k causes underfitting low k causes overfitting

Answer 30

overfitting when alpha in the ridge is too high, we get underfitting

Answer 31

it is used to measure feature importance

Answer 32

no, it is not a good measure on uneven classes. we can use a confusion matrix instead

Answer 33

parameters taht we specify before fitting a model like alpha and n_neigbors

Answer 34

1- Try lots of different hyperparameter values 2- Fit all of them separately 3- See how well they perform 4- Choose the best performing values

Answer 35

to avoid overfitting the hyperparameters to the test set

Answer 36

we use standardization when the data follow a gaussian distribution or when the features are normally distributed (linear and logistic regression or neural networks) mean =0 and std =1 (it does not maintain the shape of the original distribution) we use normalization when we know the data does not follow gaussian (normal) distribution it maintains the shape of the original distribution

Answer 37

L1 (lass) and l2 (ridge) LogisticRegression(solver='liblinear', penalty='l1') l2 by default

Answer 38

widely used for classification problems but can be employed in regression problems

Answer 39

svm uses this trick to transform non-seperable datasets to a higher dimension to become linearly separable.

Answer 40

size of the dataset: fewer features = simpler model and faster training time interpretability: easier to explain => important for stakeholders. like linear and logistic regression flexibility: improve accuracy by making fewer assumptions like the KNN metrics: RMSE, R-squared, accuracy, precision, recall

Answer 41

how spread out the samples of kmenas cluster are . it is to measure the quality of k means cluster if we dont have prelabeled clusters model.intertia_

Midterm Flashcards

(67 cards)