Midterm Flashcards

1
Q

what can data do (4)

A

 Describe the current state of an organization or process
 Detect anomalous events
 Diagnose the causes of events and behaviors
 Predict future events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

describe the 4 steps in ds workflow

A

data collection and storage

data preparation

exploration and visualization

experimentation and prediction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are the 3 applications of data science

A

traditional machine learning

internet of things

deep learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what do we need for machine learning

A

a well defined question

a set of example data

a new set of data to use our algorithm on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is deep learning

A

may neurons work together

requires much more training data

used in complex problems: image classifications, language learning/understanding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is supervised machine learning

A

predictions from data with labels and features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is churn prediction

A

trying to predict whether the customer will likely terminate their subscription with a certain service in the future

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is clustering and what are 3 use cases

A

divide data into categories

use cases:
customer segmentation
image segmentation
anomaly detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how do you slice a list in python

A

list[start:end] [inclusive (optional) : exclusive (optional)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

how do you delete an element in a list

A

del(list[index])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

does python work by reference or assignment

A

reference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how can you make a copy of a list instead of referencing the original

A

y = x[:]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what are the 3 parameters of np.random.normal()

A

distribution mean
distribution standard deviation
number of samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how to check if “x” is a key in dictionary y

A

“x” in y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is pandas

A

high level data manipulation tool built on numpy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

suppose brics is a dataframe. what is the difference between brics[“country”] and brics[[“country”]]

A

the first only lists the countries with their indexes. (type series)

the second returns a dataframe with one column, countries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what is the type of brics[1:4] considering brics is a dataframe

A

dataframe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what is the difference between df.loc[’’,’’] and df.iloc[rowint,colint]

A

loc locates keys while iloc locates indices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

how to use logical operatos with numpy

A

np.logical_and()
np.logical_or()
np.logical_not()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

create a for loop that loops through a list and prints the index and its value

A

for index, height in enumerate(fam):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

loop over the contets of a dictionary

A

for key, value in worlds.items():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

how to loop through a dataframe printing index and row content

A

for index, row in brics.iterrows():
print(index)
print(row) #row is a list in this case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what does the following do

brics[“country”].apply(len)

A

adds a column to the dataframe that contains the length of the content of country column in each row

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

sort a dataframe by multiple values in ascending and descending order

A

df.sort_values([‘col1”, “col2”], ascending=[True, False])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
how to subset a dataframe to match 2 conditions
h[cond1 & cond2] h[cond1 | cond2]
26
how to return all rows where the value in column "state" in a dataframe is one of 3 predetermined values
h1 = h[h['state'].isin(['north', 'virginia', 'arizona'])]
27
how to plot a dataframe
df.plot(xcol, ycol, kind, title) plt.show() incase of histogram df[col].hist()
28
how to find how many nulls in every column in a dataframe
df.isna().sum() (can be plotted using .plot(kind='bar'))
29
how to replace null values with a default value
df.fillna(0)
30
what is statistics
practice and study of collection and analyzing data to derive a fact or a summary
31
what are the type of statistics
descriptive statistics: describe and summarize data inferential statistics: use a sample of data to make inferences about a larger population
32
what are the types of data
numeric (quantitative): continuous (measured) discrete (counted) categorical (qualitative): nominal (unordered) ordinal (ordered) : strongly agree>
33
what is the difference between mean and median and mode
mean (average): sum/total samples median: the value where 50% of the data is above it mode: the most frequent value in the data
34
what is a left skewed histogram and right skewed
left skewed histogram is when the tail of histogram is to the left of the mean/median indicating a high concentration of low value entries (right skewed is the opposite)
35
generate a one line code that groups a dataframe by country, and measures the maen and median of consumption
df.groupby('country')['consumption'].agg([np.mean, np.median])
36
what are the two methods to calculate standard deviatino
np.sqrt(np.var(df['co'], ddof=1)) np.std(df['col'], ddof=1) note that when ddof = 0, the data is a sample size when ddof = 1, the data is all the possible population of what we are calculating
37
what are quantiles (percentiles)
spllit the data into some number of equal parts np.quantile(df[col], 0.5)
38
what is IQR
interquartile range: another measure of spread, it's the distance between the 25th and 75th percentile
39
what is an outlier
a data point that is largely different from the others a data point is an outlier if: data< Q1 - 1.5 x IQR or data > Q3 + 1.5 x IQR
40
what is a plot that visualizes outliers
boxplot
41
how to check if arrays A and B are equal?
np.allclose(A, B)
42
how to select a random entry from a dataframe
df.sample(n) or df.sample(n, replace=False) to completely remove the sampled entry
43
what does np.random.seed(5)
initializes the initial number used for pseudorandom calculation so that we get the same random numbers on every run of the code
44
how to generate a random number using scipy.stats
uniform.rvs(start, end, arraySize)
45
how to get the probability of a continuous distribution function
uniform.cdf(end, start, probability)
46
when does teh binomial distribution fail to apply
when the trials are not independent, the binomial distribution does not apply
47
what is the inverse of norm.cdf(intended, mean, variance)
norm.ppf(percent, mean, variance)
48
what is the difference between pmf and cdf
pmf, probaility at x cdf probability up to x
49
what is correlation
a number that defines the relationship between x and y [-1, 1] if it is close to 0, a weaker relationship exists df[x],corr(df[y]) could be used for linear regression
50
how to visualize the linear regression model
import seaborn as sns sns.lmplot(x, y, data, ci) plt.show()
51
what are pivot tables
tables that are derived from original tables df.pivot_table(values=, index=, aggfunc=[np.mean, np.median]) aggfunc is optional and can be omitted can also add columns=boolcol to calculate the mean of values for that bool col
52
what are the requirements of supervised learningn
no missing values data in numeric format data stored in padas dataframe o numpy array
53
what is k nearest neighbors
predict the label of any data point by looking at the k closest labeled data points and getting them to vote on what label the unlabeled observation should have
54
what happens when our selected k in kNN is too high
high k causes underfitting low k causes overfitting
55
wha happens if a and b are too high in linear regression
overfitting when alpha in the ridge is too high, we get underfitting
56
when is lasso regression used
it is used to measure feature importance
57
is accuracy always a good measure? what can replace it
no, it is not a good measure on uneven classes. we can use a confusion matrix instead
58
what are hyper parameters
parameters taht we specify before fitting a model like alpha and n_neigbors
59
how do you achieve hyper parameter tunig
1- Try lots of different hyperparameter values 2- Fit all of them separately 3- See how well they perform 4- Choose the best performing values
60
why do we use cross-validatino when fitting different hyperparameters
to avoid overfitting the hyperparameters to the test set
61
when do we use standardization and when do we use normalization. how do we do them
we use standardization when the data follow a gaussian distribution or when the features are normally distributed (linear and logistic regression or neural networks) mean =0 and std =1 (it does not maintain the shape of the original distribution) we use normalization when we know the data does not follow gaussian (normal) distribution it maintains the shape of the original distribution
62
what is L1 and L2 linear regression
L1 (lass) and l2 (ridge) LogisticRegression(solver='liblinear', penalty='l1') l2 by default
63
when is SVM used
widely used for classification problems but can be employed in regression problems
64
what is a kernel trick
svm uses this trick to transform non-seperable datasets to a higher dimension to become linearly separable.
65
what are the KPI's to evaluate models
size of the dataset: fewer features = simpler model and faster training time interpretability: easier to explain => important for stakeholders. like linear and logistic regression flexibility: improve accuracy by making fewer assumptions like the KNN metrics: RMSE, R-squared, accuracy, precision, recall
66
what is inertia
how spread out the samples of kmenas cluster are . it is to measure the quality of k means cluster if we dont have prelabeled clusters model.intertia_
67