152 - 200 Flashcards

Question 1

Q

numpy.info(object=None, maxwidth=76, output=None, toplevel=’numpy’)

Answer

A

Get help information for a function, class, or module.

np.info(np.polyval) 
   polyval(p, x)
     Evaluate the polynomial p at x.

np.info('fft') 
     * Found in numpy *
Core FFT routines

     * Found in numpy.fft *
 fft(a, n=None, axis=-1)

     *Repeat reference found in numpy.fft.fftpack*
     *Total of 3 references found. *

numpy.info

Question 2

Q

numpy.view([dtype][, type])

Answer

A

helps to get a new view of the array with the same data.

a = np.arange(10, dtype ='int16')
print("a is: \n", a)

👉 [0 1 2 3 4 5 6 7 8 9]

v = a.view('int32')
print("\n After using view() with dtype = 'int32' a is : \n", a)

After using view() with dtype = 'int32' a is : 
👉 [0 1 2 3 4 5 6 7 8 9]

v += 1
print("\n After using view() with dtype = 'int32' and adding 1 a is : \n", a)

After using view() with dtype = 'int32' and adding 1 a is : 
👉 [1 1 3 3 5 5 7 7 9 9]

numpy.view
numpy.view

Question 3

Q

numpy.r_

Answer

A

Translates slice objects to concatenation along the first axis. This is a simple way to build up arrays quickly.

np.r_['r',[1,2,3], [4,5,6]]
matrix([[1, 2, 3, 4, 5, 6]])

np.r_['0,2,0', [1,2,3], [4,5,6]]
👉 array([[1],[2],[3],[4],[5],[6]])

np.r_['1,2,0', [1,2,3], [4,5,6]]
👉 array([[1, 4],
           [2, 5],
           [3, 6]])

a = np.array([[0, 1, 2], [3, 4, 5]])
np.r_['-1', a, a] # concatenate along last axis
👉 array([[0, 1, 2, 0, 1, 2],
           [3, 4, 5, 3, 4, 5]])

np.r_['0,2', [1,2,3], [4,5,6]] # concatenate along first axis, dim>=2
👉 array([[1, 2, 3],
           [4, 5, 6]])

numpy.r_

Question 4

Q

numpy.c_

Answer

A

Translates slice objects to concatenation along the second axis.

np.c_[np.array([1,2,3]), np.array([4,5,6])]
👉 array([[1, 4],
          [2, 5],
          [3, 6]])

np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]

numpy.c_

Question 5

Q

pandas.DataFrame.sum(axis=None, skipna=True, level=None, numeric_only=None, min_count=0, **kwargs)

Answer

A

Return the sum of the values over the requested axis.

idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
s = pd.Series([4, 2, 0, 8], name='legs', index=idx)

s.sum()
👉 14

pandas.DataFrame.sum

Question 6

Q

pandas.DataFrame.dtypes

Answer

A

Return the dtypes in the DataFrame. This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns.

df = pd.DataFrame({'float': [1.0],
                   'int': [1],
                   'datetime': [pd.Timestamp('20180310')],
                   'string': ['foo']})
df.dtypes
float               float64
int                  int64
datetime        datetime64[ns]
string             object
dtype:            object

pandas.DataFrame.dtypes

Question 7

Q

pandas.DataFrame.isna()

Answer

A

Detect missing values. Return a boolean same-sized object indicating if the values are NA.

df = pd.DataFrame(dict(age=[5, 6, np.NaN],
                   born=[pd.NaT, pd.Timestamp('1939-05-27'), 
                   pd.Timestamp('1940-04-25')],
                   name=['Alfred', 'Batman', ''],
                   toy=[None, 'Batmobile', 'Joker']))

df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

pandas.DataFrame.isna

Question 8

Q

statsmodels.formula.api.ols(endog, exog=None, missing=’none’, hasconst=None, **kwargs)

Answer

A

Used to perform linear regression. Get ordinary least squares, and fit() method is used to fit the data in it.

import statsmodels.formula.api as smf
df = pd.read_csv('headbrain1.csv')

df.columns = ['Head_size', 'Brain_weight']
model = smf.ols(formula='Head_size ~ Brain_weight', data=df).fit()

#model summary
print(model.summary())

statsmodels.formula.api.ols
statsmodels.formula.api.ols
statsmodels.formula.api.ols

Question 9

Q

statsmodels. api.OLS(y, x)

Answer

A

statsmodels. api.OLS(y, x) — method of linear regression.
statsmodels. api.add_constant — что бы добавить то от чего будет отталкивается (начальное)

y : the variable which is dependent on x
x : the independent variable

.fit() | .summary() | .params | .predict()

import statsmodels.api as sm
data = pd.read_csv('train.csv')

x = data['x'].tolist()
y = data['y'].tolist()

x = sm.add_constant(x)
result = sm.OLS(y, x).fit()
print(result.summary())

statsmodels. api.OLS
statsmodels. api.OLS
statsmodels. api.OLS
statsmodels. api.OLS
statsmodels. api.OLS

Question 10

Q

Multicollinearity

Answer

A

Occurs when there are two or more independent variables in a multiple regression model, which have a high correlation among themselves. When some features are highly correlated, we might have difficulty in distinguishing between their individual effects on the dependent variable.

Question 11

Q

statsmodels.api.qqplot(Quantile-Quantile Plot)

Answer

A

The plot provides a summary of whether the distributions of the two variables are similar or not with respect to the locations.

import statsmodels.api as sm
import pylab as py

data_points = np.random.normal(0, 1, 100)    
sm.qqplot(data_points, line ='45')

statsmodels.api.qqplot
statsmodels.api.qqplot

Question 12

Q

seaborn.load_dataset(name, cache=True, data_home=None, **kws)

Answer

A

Load an example dataset from the online repository (requires internet).

seaborn.load_dataset

Question 13

Q

pandas_profiling.ProfileReport(df, **kwargs)

Answer

A

Which generates a basic report on the input DataFrame.

data = pd.DataFrame(dict)
print(data)

#forming ProfileReport and save as output.html file
profile = pp.ProfileReport(data)
profile.to_file("output.html")

pandas_profiling.ProfileReport

Question 14

Q

statsmodels.api.Logit()

Answer

A

building the model and fitting the data

Function for performing logistic regression. The Logit() function accepts y and X as parameters and returns the Logit object. Logistic regression is the type of regression analysis used to find the probability of a certain event occurring(произойдет событие).

import statsmodels.api as sm

df = pd.read_csv('logit_train1.csv', index_col = 0)
Xtrain = df[['gmat', 'gpa', 'work_experience']]
ytrain = df[['admitted']]

log_reg = sm.Logit(ytrain, Xtrain).fit()
print(log_reg.summary())

statsmodels.api.Logit

Question 15

Q

class.__dict__

Answer

A

contains all the attributes of the class. A dictionary or other mapping object used to store an object’s (writable) attributes.

class Shape(object):
    def \_\_init\_\_(self, **kwargs):
        self.\_\_dict\_\_.update(**kwargs)

class Circle(Shape):
    def \_\_init\_\_(self, **kwargs):
        super(Circle, self).\_\_init\_\_(**kwargs)

class.__dict__

Question 16

Q

pandas.DataFrame.pivot(index=None, columns=None, values=None)

Answer

A

Return reshaped DataFrame organized by given index/column values. Короче меняет местами колонки и ряды.

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})

f.pivot(index='foo', columns='bar', values='baz')
bar  A   B   C
foo
one  1   2   3
two  4   5   6

df.pivot(index='foo', columns='bar')['baz']
bar  A   B   C
foo
one  1   2   3
two  4   5   6

df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
          baz       zoo
bar   A  B  C   A  B  C
foo
one   1  2  3   x  y  z
two   4  5  6   q  w  t

pandas.DataFrame.pivot

Question 17

Q

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)

Answer

A

Used to separate the array of elements into different bins. The cut function is mainly used to perform statistical analysis on scalar data.

pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...

df= pd.DataFrame({'number': np.random.randint(1, 100, 10)})
df['bins'] = pd.cut(x=df['number'], bins=[1, 20, 40, 60, 80, 100])

pandas.cut
pandas.cut

Question 18

Q

numpy.ones_like(a, dtype=None, order=’K’, subok=True, shape=None)

Answer

A

Return an array of ones with the same shape and type as a given array.

x = array([[0, 1, 2], [3, 4, 5]])

np.ones_like(x)
👉 array([[1, 1, 1], [1, 1, 1]])

y = np.arange(3, dtype=float)
👉 array([0., 1., 2.])

np.ones_like(y)
👉 array([1.,  1.,  1.])

numpy.ones_like

Question 19

Q

pandas.DataFrame.corrwith(other, axis=0, drop=False, method=’pearson’)

Answer

A

Compute pairwise correlation. Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.

df1 = pd.DataFrame({"A":[1, 5, 7, 8], "B":[5, 8, 4, 3], "C":[10, 4, 9, 3]})
df2 = pd.DataFrame({"A":[5, 3, 6, 4], "B":[11, 2, 4, 3], "C":[4, 3, 8, 5]})

#To find the correlation among the columns of df1 and df2 along the column axis
df1.corrwith(df2, axis = 0)

pandas.DataFrame.corrwith
pandas.DataFrame.corrwith

Question 20

Q

pandas.DataFrame.corr(method=’pearson’, min_periods=1)

Answer

A

The method finds the correlation of each column in a DataFrame.

data = {
  "Duration": [50, 40, 45],
  "Pulse": [109, 117, 110],
  "Calories": [409.1, 479.5, 340.8]  
}
df = pd.DataFrame(data)
print(df.corr())

                Duration     Pulse           Calories
Duration  1.000000    -0.917663   -0.507551
Pulse       -0.917663  1.000000     0.808134
Calories   -0.507551  0.808134     1.000000

pandas.DataFrame.corr
pandas.DataFrame.corr
pandas.DataFrame.corr

Question 21

Q

Classification

Answer

A

will it be Cold or Hot tomorrow?

Question 22

Q

Regression

Answer

A

what is the temperature going to be tomorrow?

Question 23

Q

pandas.DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=’raise’)

Answer

A

Drop specified labels from rows or columns. Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.

inplace - if true replacer existing data

X = data.drop(columns=['SalePrice'])
y = data['SalePrice']

data.drop(columns='WallMat', inplace=True) # Drop WallMat column

df = pd.DataFrame(np.arange(12).reshape(3, 4), columns=['A', 'B', 'C', 'D'])
df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

pandas.DataFrame.drop

Question 24

Q

sklearn.datasets.make_regression(n_samples=100, n_features=100, *, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)

Answer

A

Generate a random regression problem. The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile.

from sklearn.datasets import make_regression

x, y = make_regression(n_samples = 100, n_features = 1, n_informative = 1, noise = 10, random_state = 42)

Question 25

Q

Error analysis

Answer

A

An iterative process for identifying common themes within our model’s mistakes

Do specific cohorts within our X data perform better or worse than others?
Does a class consistently perform better or worse than another?
Are some errors so large they drag overall performance down?
These questions can lead you to more data collection or enhanced feature engineering.

Error analysis

Question 26

Q

pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name=’All’, dropna=True, normalize=False)

Answer

A

Compute a simple cross tabulation(Quantitative analysis of the relationship between multiple variables in a table) of two (or more) factors.

By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

y_test = [0, 1, 0, 0, 1, 0, 1, 1, 0, 1] # actual truths
preds = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] # predictions

results_df = pd.DataFrame({"actual": y_test, "predicted": preds}) 
confusion_matrix = pd.crosstab(index= results_df['actual'], columns = results_df['predicted'])

predicted 0 1
   actual
            0 3 2
            1 1 4

pandas.crosstab

Question 27

Q

sklearn.dummy.DummyRegressor(*, strategy=’mean’, constant=None, quantile=None)

Answer

A

Gives predictions based on simple strategies without paying any attention to the input Data.

baseline_model = DummyRegressor(strategy="mean")  # Baseline
baseline_model.fit(X_train, y_train)  # Calculate value for strategy
baseline_model.score(X_test, y_test)

sklearn.dummy.DummyRegressor
sklearn.dummy.DummyRegressor

Question 28

Q

Evaluation metrics

Answer

A

Used to measure how well a machine learning model can perform a task.

Evaluation metrics

Question 29

Q

Feature Selection

Answer

A

Process of eliminating non-informative features.

Feature Correlation (Univariate - одномерный)
Feature Permutation (Multivariate - многомерный)

Why feature selection?
- Garbage in, garbage out
- The curse of dimensionality
- Reducing complexity

corr = data.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap= "YlGnBu");

Feature Selection
Feature Selection

Question 30

Q

Feature creation

Answer

A

Can introduce our domain knowledge into a dataset to drive more signal for our models to learn.

Why create new features?
- Create additional information
- Potentially improve model performance

Examples of creating new features
- bedroom to total_room ratio
- height divided by weight for Body Mass Index
- delivered_date - dispatch_date for the lag time between events
- Categorize the date as either weekday or weekend

Feature creation

Question 31

Q

Discretizing

Answer

A

Process of turning continuous(непрерывные) data into discrete data using bins. Turn a regression task into a classification task. Perform feature engineering.

Example: turn the dataset into a classification task, Cheap or Expensive, according to the mean.

data['SalePriceBinary'] = pd.cut(x = data['SalePrice'],
                   bins=[data['SalePrice'].min()-1,
                   data['SalePrice'].mean(),
                   data['SalePrice'].max()+1], 
                   labels=['cheap', 'expensive'])

Discretizing

Question 32

Q

Encoding(кодировка)

Answer

A

Consists of transforming non-numerical data into an equivalent numerical form.

Why encoding?
- Data may be represented as words, letters, or symbols
- Most Machine Learning algorithms only process numerical data

Encoding(кодировка)

Question 33

Q

Dataset balancing

Answer

A

Generate higher accuracy models, higher balanced accuracy, and a balanced detection rate. In a classification dataset, the number of data points representing each class is often unequal.

Why balancing?
- ML algorithms learn by example
- Will tend to predict the under-represented class poorly
- ~30:70 split for binary classification would be considered imbalanced

Balancing strategies
- Over-sampling of minority class
- Under-sampling of the majority class
- Computation of new minority class instances

Dataset balancing
Dataset balancing
Dataset balancing

Question 34

Q

pandas.DataFrame.replace(to_replace=None, value=NoDefault.no_default, inplace=False, limit=None, regex=False, method=NoDefault.no_default)

Answer

A

Replace values given in to_replace with value. Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

data.Alley.replace(np.nan, "NoAlley", inplace=True) #Replace NaN by "NoAlley"

s = pd.Series([1, 2, 3, 4, 5])
s.replace(1, 5)
0    5
1    2
2    3
3    4
4    5

df.replace(to_replace="Boston Celtics", value="Omega Warrior")

#Replace missing Pesos values with mean
data.Pesos.replace(np.nan, data.Pesos.mean()

pandas.DataFrame.replace
pandas.DataFrame.replace

Question 35

Q

Feature Scaling

Answer

A

Transforming continuous features into a common, smaller range.

Why scaling?
- Features with large magnitudes can incorrectly outweigh features of small magnitudes
- Scaling to smaller magnitudes improves computational efficiency
- Increases interpretability of feature coefficients

Feature Scaling

Question 36

Q

Outliers

Answer

A

Data points that deviate from the rest of the data.

Common reasons for outliers
- Data entry errors
- Measurement errors
- Data manipulation and preprocessing errors
- Novelties (not errors)

Outliers affect:
- Dataset distributions and patterns
- Central tendency metrics e.g. mean and standard deviation
- Machine learning models’ performances

Handling Outliers
- Is the outlier evidently false?
- Could it be a novelty?
- Could it be used as a feature?

See Outliers
data[['GrLivArea']].boxplot()

Dropping Outliers

fal_ob = data['GrLivArea'].argmin() # Get index corresponding to minimum value
data = data.drop(false_observation).reset_index(drop=True) # Drop row

Outliers

Question 37

Q

pandas.isnull()

Answer

A

Detect missing values for an array-like object. This function takes a scalar or array-like object and indicates whether values are missing.

(data.WallMat.isnull().sum()/len(data))*100 # Percentage of missing values

data.isnull().sum().sort_values(ascending=False)

pd.isna('dog')
👉 False

pd.isna(pd.NA)
👉 True

pandas.isnull

Question 38

Q

pandas.DataFrame.duplicated(subset=None, keep=’first’)

Answer

A

Return boolean Series denoting duplicate rows. Considering certain columns is optional.

df = pd.DataFrame({
       'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
       'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
       'rating': [4, 4, 3.5, 15, 5]
})

df.duplicated()
0    False
1    True
2    False
3    False
4    False

pandas.DataFrame.duplicated

Question 39

Q

pandas.Series.argmin and pandas.Series.argmax(axis=None, skipna=True, *args, **kwargs)

Answer

A

Return the int position of the smallest value in the Series. If the minimum is achieved in multiple locations, the first-row position is returned.

s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
                       'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0})

Corn Flakes                        100.0
Almond Delight                   110.0
Cinnamon Toast Crunch     120.0
Cocoa Puff                          110.0
dtype: float64

s.argmax()
👉 2

s.argmin()
👉 0

pandas.Series.argmin

Question 40

Q

False Negative

Answer

A

Incorrectly identifying a negative sample.

👉 Predicted ❌
👉 Actual ✅

False Negative

Question 41

Q

False Positive

Answer

A

Incorrectly identifying a positive sample.

👉 Predicted ✅
👉 Actual ❌

False Positive

Question 42

Q

True Negative

Answer

A

Correctly identifying a negative sample.

👉 Predicted ❌
👉 Actual ❌

True Negative

Question 43

Q

True Positive

Answer

A

Correctly identifying a positive sample.

👉 Predicted ✅
👉 Actual ✅

True Positive

Question 44

Q

seaborn.heatmap(data, *, vmin=None, vmax=None, cmap=None, center=None, robust=False, annot=None, fmt=’.2g’, annot_kws=None, linewidths=0, linecolor=’white’, cbar=True, cbar_kws=None, cbar_ax=None, square=False, xticklabels=’auto’, yticklabels=’auto’, mask=None, ax=None, **kwargs)

Answer

A

Plot rectangular data as a color-encoded matrix.

sns.heatmap(pd.DataFrame(X).corr(), cmap='coolwarm')

plt.figure(figsize=(10,7))
mask = np.triu(np.ones_like(orders.corr(), dtype=bool))
sns.heatmap(orders.corr(), annot=True, )

uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data)

seaborn.heatmap
seaborn.heatmap

Question 45

Q

pandas.DataFrame.select_dtypes(include=None, exclude=None)

Answer

A

Return a subset of the DataFrame’s columns based on the column types.

df.select_dtypes(include=['float64'])
   c
0  1.0
1  2.0
2  1.0
3  2.0
4  1.0
5  2.0

df.select_dtypes(exclude=['int64'])
       b       c
0   True  1.0
1  False  2.0
2   True  1.0
3  False  2.0
4   True  1.0

pandas.DataFrame.select_dtypes

Question 46

Q

pandas.Index(data=None, dtype=None, copy=False, name=None, tupleize_cols=True, **kwargs)

Answer

A

The immutable sequence is used for indexing and alignment. The basic object storing axis labels for all pandas objects.

pd.Index([1, 2, 3])
Int64Index([1, 2, 3], dtype='int64')

pd.Index(list('abc'))
Index(['a', 'b', 'c'], dtype='object')

pandas.Index

Question 47

Q

sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, , n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch=’2n_jobs’, random_state=None, error_score=nan, return_train_score=False)

Answer

A

Randomized search on hyper parameters. RandomizedSearchCV фактически такой же, как и GridSearchCV, но он заменяет поиск параметров по сетке GridSearchCV на случайную выборку в пространстве параметров.

Использовать если количество параметров, которые следует учитывать, особенно велико, а масштабы влияния несбалансированный.

rnd_search = RandomizedSearchCV(RandomForestClassifier(), param, 
n_iter =10, cv=9)
rnd_search.fit(X,y)
rnd_search.best_params_

sklearn.model_selection.RandomizedSearchCV

Question 48

Q

pandas.DataFrame.hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, **kwargs)

Answer

A

Make a histogram of the DataFrame’s columns. A histogram is a representation of the distribution of data.

data[['GrLivArea']].plot.hist(bins=20)

df = pd.DataFrame({
     'length': [1.5, 0.5, 1.2, 0.9, 3],
     'width': [0.7, 0.2, 0.15, 0.2, 1.1]
     }, index=['pig', 'rabbit', 'duck', 'chicken', 'horse'])
hist = df.hist(bins=3)

pandas.DataFrame.hist

Question 49

Q

threshold

Answer

A

one of the most basic techniques for Segmentation. You get segments inside each representing something. For example… complex segmentation algorithms might be able to segment out “house-like” structures in an image.

threshold

Brainscape's Knowledge GenomeTM

152 - 200 Flashcards

Brainscape's Knowledge Genome^TM