152 - 200 Flashcards

1
Q

numpy.info(object=None, maxwidth=76, output=None, toplevel=’numpy’)

A

Get help information for a function, class, or module.

np.info(np.polyval) 
   polyval(p, x)
     Evaluate the polynomial p at x.
np.info('fft') 
     * Found in numpy *
Core FFT routines

     * Found in numpy.fft *
 fft(a, n=None, axis=-1)

     *Repeat reference found in numpy.fft.fftpack*
     *Total of 3 references found. *
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

numpy.view([dtype][, type])

A

helps to get a new view of the array with the same data.

a = np.arange(10, dtype ='int16')
print("a is: \n", a)

👉 [0 1 2 3 4 5 6 7 8 9]
v = a.view('int32')
print("\n After using view() with dtype = 'int32' a is : \n", a)

After using view() with dtype = 'int32' a is : 
👉 [0 1 2 3 4 5 6 7 8 9]
v += 1
print("\n After using view() with dtype = 'int32' and adding 1 a is : \n", a)

After using view() with dtype = 'int32' and adding 1 a is : 
👉 [1 1 3 3 5 5 7 7 9 9]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

numpy.r_

A

Translates slice objects to concatenation along the first axis. This is a simple way to build up arrays quickly.

np.r_['r',[1,2,3], [4,5,6]]
matrix([[1, 2, 3, 4, 5, 6]])
np.r_['0,2,0', [1,2,3], [4,5,6]]
👉 array([[1],[2],[3],[4],[5],[6]])
np.r_['1,2,0', [1,2,3], [4,5,6]]
👉 array([[1, 4],
           [2, 5],
           [3, 6]])
a = np.array([[0, 1, 2], [3, 4, 5]])
np.r_['-1', a, a] # concatenate along last axis
👉 array([[0, 1, 2, 0, 1, 2],
           [3, 4, 5, 3, 4, 5]])
np.r_['0,2', [1,2,3], [4,5,6]] # concatenate along first axis, dim>=2
👉 array([[1, 2, 3],
           [4, 5, 6]])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

numpy.c_

A

Translates slice objects to concatenation along the second axis.

np.c_[np.array([1,2,3]), np.array([4,5,6])]
👉 array([[1, 4],
          [2, 5],
          [3, 6]])
np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

pandas.DataFrame.sum(axis=None, skipna=True, level=None, numeric_only=None, min_count=0, **kwargs)

A

Return the sum of the values over the requested axis.

idx = pd.MultiIndex.from_arrays([
    ['warm', 'warm', 'cold', 'cold'],
    ['dog', 'falcon', 'fish', 'spider']],
    names=['blooded', 'animal'])
s = pd.Series([4, 2, 0, 8], name='legs', index=idx)

s.sum()
👉 14
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

pandas.DataFrame.dtypes

A

Return the dtypes in the DataFrame. This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns.

df = pd.DataFrame({'float': [1.0],
                   'int': [1],
                   'datetime': [pd.Timestamp('20180310')],
                   'string': ['foo']})
df.dtypes
float               float64
int                  int64
datetime        datetime64[ns]
string             object
dtype:            object
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

pandas.DataFrame.isna()

A

Detect missing values. Return a boolean same-sized object indicating if the values are NA.

df = pd.DataFrame(dict(age=[5, 6, np.NaN],
                   born=[pd.NaT, pd.Timestamp('1939-05-27'), 
                   pd.Timestamp('1940-04-25')],
                   name=['Alfred', 'Batman', ''],
                   toy=[None, 'Batmobile', 'Joker']))

df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

statsmodels.formula.api.ols(endog, exog=None, missing=’none’, hasconst=None, **kwargs)

A

Used to perform linear regression. Get ordinary least squares, and fit() method is used to fit the data in it.

import statsmodels.formula.api as smf
df = pd.read_csv('headbrain1.csv')

df.columns = ['Head_size', 'Brain_weight']
model = smf.ols(formula='Head_size ~ Brain_weight', data=df).fit()

#model summary
print(model.summary())
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

statsmodels. api.OLS(y, x)

A

statsmodels. api.OLS(y, x) — method of linear regression.
statsmodels. api.add_constant — что бы добавить то от чего будет отталкивается (начальное)

  • y : the variable which is dependent on x
  • x : the independent variable

.fit() | .summary() | .params | .predict()

import statsmodels.api as sm
data = pd.read_csv('train.csv')

x = data['x'].tolist()
y = data['y'].tolist()

x = sm.add_constant(x)
result = sm.OLS(y, x).fit()
print(result.summary())
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Multicollinearity

A

Occurs when there are two or more independent variables in a multiple regression model, which have a high correlation among themselves. When some features are highly correlated, we might have difficulty in distinguishing between their individual effects on the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

statsmodels.api.qqplot(Quantile-Quantile Plot)

A

The plot provides a summary of whether the distributions of the two variables are similar or not with respect to the locations.

import statsmodels.api as sm
import pylab as py

data_points = np.random.normal(0, 1, 100)    
sm.qqplot(data_points, line ='45')
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

seaborn.load_dataset(name, cache=True, data_home=None, **kws)

A

Load an example dataset from the online repository (requires internet).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

pandas_profiling.ProfileReport(df, **kwargs)

A

Which generates a basic report on the input DataFrame.

data = pd.DataFrame(dict)
print(data)
#forming ProfileReport and save as output.html file
profile = pp.ProfileReport(data)
profile.to_file("output.html")
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

statsmodels.api.Logit()

A

building the model and fitting the data

Function for performing logistic regression. The Logit() function accepts y and X as parameters and returns the Logit object. Logistic regression is the type of regression analysis used to find the probability of a certain event occurring(произойдет событие).

import statsmodels.api as sm

df = pd.read_csv('logit_train1.csv', index_col = 0)
Xtrain = df[['gmat', 'gpa', 'work_experience']]
ytrain = df[['admitted']]
log_reg = sm.Logit(ytrain, Xtrain).fit()
print(log_reg.summary())
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

class.__dict__

A

contains all the attributes of the class. A dictionary or other mapping object used to store an object’s (writable) attributes.

class Shape(object):
    def \_\_init\_\_(self, **kwargs):
        self.\_\_dict\_\_.update(**kwargs)
class Circle(Shape):
    def \_\_init\_\_(self, **kwargs):
        super(Circle, self).\_\_init\_\_(**kwargs)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

pandas.DataFrame.pivot(index=None, columns=None, values=None)

A

Return reshaped DataFrame organized by given index/column values. Короче меняет местами колонки и ряды.

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
f.pivot(index='foo', columns='bar', values='baz')
bar  A   B   C
foo
one  1   2   3
two  4   5   6
df.pivot(index='foo', columns='bar')['baz']
bar  A   B   C
foo
one  1   2   3
two  4   5   6
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
          baz       zoo
bar   A  B  C   A  B  C
foo
one   1  2  3   x  y  z
two   4  5  6   q  w  t
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates=’raise’, ordered=True)

A

Used to separate the array of elements into different bins. The cut function is mainly used to perform statistical analysis on scalar data.

pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
df= pd.DataFrame({'number': np.random.randint(1, 100, 10)})
df['bins'] = pd.cut(x=df['number'], bins=[1, 20, 40, 60, 80, 100])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

numpy.ones_like(a, dtype=None, order=’K’, subok=True, shape=None)

A

Return an array of ones with the same shape and type as a given array.

x = array([[0, 1, 2], [3, 4, 5]])

np.ones_like(x)
👉 array([[1, 1, 1], [1, 1, 1]])
y = np.arange(3, dtype=float)
👉 array([0., 1., 2.])

np.ones_like(y)
👉 array([1.,  1.,  1.])
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

pandas.DataFrame.corrwith(other, axis=0, drop=False, method=’pearson’)

A

Compute pairwise correlation. Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.

df1 = pd.DataFrame({"A":[1, 5, 7, 8], "B":[5, 8, 4, 3], "C":[10, 4, 9, 3]})
df2 = pd.DataFrame({"A":[5, 3, 6, 4], "B":[11, 2, 4, 3], "C":[4, 3, 8, 5]})

#To find the correlation among the columns of df1 and df2 along the column axis
df1.corrwith(df2, axis = 0)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

pandas.DataFrame.corr(method=’pearson’, min_periods=1)

A

The method finds the correlation of each column in a DataFrame.

data = {
  "Duration": [50, 40, 45],
  "Pulse": [109, 117, 110],
  "Calories": [409.1, 479.5, 340.8]  
}
df = pd.DataFrame(data)
print(df.corr())

                Duration     Pulse           Calories
Duration  1.000000    -0.917663   -0.507551
Pulse       -0.917663  1.000000     0.808134
Calories   -0.507551  0.808134     1.000000
21
Q

Classification

A

will it be Cold or Hot tomorrow?

22
Q

Regression

A

what is the temperature going to be tomorrow?

23
Q

pandas.DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=’raise’)

A

Drop specified labels from rows or columns. Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names.

  • inplace - if true replacer existing data
X = data.drop(columns=['SalePrice'])
y = data['SalePrice']

data.drop(columns='WallMat', inplace=True) # Drop WallMat column
df = pd.DataFrame(np.arange(12).reshape(3, 4), columns=['A', 'B', 'C', 'D'])
df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11
24
Q

sklearn.datasets.make_regression(n_samples=100, n_features=100, *, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)

A

Generate a random regression problem. The input set can either be well conditioned (by default) or have a low rank-fat tail singular profile.

from sklearn.datasets import make_regression

x, y = make_regression(n_samples = 100, n_features = 1, n_informative = 1, noise = 10, random_state = 42)
25
Q

Error analysis

A

An iterative process for identifying common themes within our model’s mistakes

  • Do specific cohorts within our X data perform better or worse than others?
  • Does a class consistently perform better or worse than another?
  • Are some errors so large they drag overall performance down?
  • These questions can lead you to more data collection or enhanced feature engineering.
26
Q

pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name=’All’, dropna=True, normalize=False)

A

Compute a simple cross tabulation(Quantitative analysis of the relationship between multiple variables in a table) of two (or more) factors.

By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.

y_test = [0, 1, 0, 0, 1, 0, 1, 1, 0, 1] # actual truths
preds = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1] # predictions

results_df = pd.DataFrame({"actual": y_test, "predicted": preds}) 
confusion_matrix = pd.crosstab(index= results_df['actual'], columns = results_df['predicted'])

predicted 0 1
   actual
            0 3 2
            1 1 4
27
Q

sklearn.dummy.DummyRegressor(*, strategy=’mean’, constant=None, quantile=None)

A

Gives predictions based on simple strategies without paying any attention to the input Data.

baseline_model = DummyRegressor(strategy="mean")  # Baseline
baseline_model.fit(X_train, y_train)  # Calculate value for strategy
baseline_model.score(X_test, y_test)
28
Q

Evaluation metrics

A

Used to measure how well a machine learning model can perform a task.

29
Q

Feature Selection

A

Process of eliminating non-informative features.

Feature Correlation (Univariate - одномерный)
Feature Permutation (Multivariate - многомерный)

Why feature selection?
- Garbage in, garbage out
- The curse of dimensionality
- Reducing complexity

corr = data.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, cmap= "YlGnBu");

Feature Selection
Feature Selection

30
Q

Feature creation

A

Can introduce our domain knowledge into a dataset to drive more signal for our models to learn.

Why create new features?
- Create additional information
- Potentially improve model performance

Examples of creating new features
- bedroom to total_room ratio
- height divided by weight for Body Mass Index
- delivered_date - dispatch_date for the lag time between events
- Categorize the date as either weekday or weekend

31
Q

Discretizing

A

Process of turning continuous(непрерывные) data into discrete data using bins. Turn a regression task into a classification task. Perform feature engineering.

Example: turn the dataset into a classification task, Cheap or Expensive, according to the mean.

data['SalePriceBinary'] = pd.cut(x = data['SalePrice'],
                   bins=[data['SalePrice'].min()-1,
                   data['SalePrice'].mean(),
                   data['SalePrice'].max()+1], 
                   labels=['cheap', 'expensive'])
32
Q

Encoding(кодировка)

A

Consists of transforming non-numerical data into an equivalent numerical form.

Why encoding?
- Data may be represented as words, letters, or symbols
- Most Machine Learning algorithms only process numerical data

33
Q

Dataset balancing

A

Generate higher accuracy models, higher balanced accuracy, and a balanced detection rate. In a classification dataset, the number of data points representing each class is often unequal.

Why balancing?
- ML algorithms learn by example
- Will tend to predict the under-represented class poorly
- ~30:70 split for binary classification would be considered imbalanced

Balancing strategies
- Over-sampling of minority class
- Under-sampling of the majority class
- Computation of new minority class instances

34
Q

pandas.DataFrame.replace(to_replace=None, value=NoDefault.no_default, inplace=False, limit=None, regex=False, method=NoDefault.no_default)

A

Replace values given in to_replace with value. Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

data.Alley.replace(np.nan, "NoAlley", inplace=True) #Replace NaN by "NoAlley"
s = pd.Series([1, 2, 3, 4, 5])
s.replace(1, 5)
0    5
1    2
2    3
3    4
4    5
df.replace(to_replace="Boston Celtics", value="Omega Warrior")
#Replace missing Pesos values with mean
data.Pesos.replace(np.nan, data.Pesos.mean()
35
Q

Feature Scaling

A

Transforming continuous features into a common, smaller range.

Why scaling?
- Features with large magnitudes can incorrectly outweigh features of small magnitudes
- Scaling to smaller magnitudes improves computational efficiency
- Increases interpretability of feature coefficients

36
Q

Outliers

A

Data points that deviate from the rest of the data.

Common reasons for outliers
- Data entry errors
- Measurement errors
- Data manipulation and preprocessing errors
- Novelties (not errors)

Outliers affect:
- Dataset distributions and patterns
- Central tendency metrics e.g. mean and standard deviation
- Machine learning models’ performances

Handling Outliers
- Is the outlier evidently false?
- Could it be a novelty?
- Could it be used as a feature?

See Outliers
data[['GrLivArea']].boxplot()
Dropping Outliers

fal_ob = data['GrLivArea'].argmin() # Get index corresponding to minimum value
data = data.drop(false_observation).reset_index(drop=True) # Drop row
37
Q

pandas.isnull()

A

Detect missing values for an array-like object. This function takes a scalar or array-like object and indicates whether values are missing.

(data.WallMat.isnull().sum()/len(data))*100 # Percentage of missing values
data.isnull().sum().sort_values(ascending=False)
pd.isna('dog')
👉 False
pd.isna(pd.NA)
👉 True
38
Q

pandas.DataFrame.duplicated(subset=None, keep=’first’)

A

Return boolean Series denoting duplicate rows. Considering certain columns is optional.

df = pd.DataFrame({
       'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
       'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
       'rating': [4, 4, 3.5, 15, 5]
})

df.duplicated()
0    False
1    True
2    False
3    False
4    False
39
Q

pandas.Series.argmin and pandas.Series.argmax(axis=None, skipna=True, *args, **kwargs)

A

Return the int position of the smallest value in the Series. If the minimum is achieved in multiple locations, the first-row position is returned.

s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
                       'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0})

Corn Flakes                        100.0
Almond Delight                   110.0
Cinnamon Toast Crunch     120.0
Cocoa Puff                          110.0
dtype: float64

s.argmax()
👉 2

s.argmin()
👉 0
40
Q

False Negative

A

Incorrectly identifying a negative sample.

👉 Predicted ❌
👉 Actual ✅

41
Q

False Positive

A

Incorrectly identifying a positive sample.

👉 Predicted ✅
👉 Actual ❌

42
Q

True Negative

A

Correctly identifying a negative sample.

👉 Predicted ❌
👉 Actual ❌

43
Q

True Positive

A

Correctly identifying a positive sample.

👉 Predicted ✅
👉 Actual ✅

44
Q

seaborn.heatmap(data, *, vmin=None, vmax=None, cmap=None, center=None, robust=False, annot=None, fmt=’.2g’, annot_kws=None, linewidths=0, linecolor=’white’, cbar=True, cbar_kws=None, cbar_ax=None, square=False, xticklabels=’auto’, yticklabels=’auto’, mask=None, ax=None, **kwargs)

A

Plot rectangular data as a color-encoded matrix.

sns.heatmap(pd.DataFrame(X).corr(), cmap='coolwarm')
plt.figure(figsize=(10,7))
mask = np.triu(np.ones_like(orders.corr(), dtype=bool))
sns.heatmap(orders.corr(), annot=True, )
uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data)
45
Q

pandas.DataFrame.select_dtypes(include=None, exclude=None)

A

Return a subset of the DataFrame’s columns based on the column types.

df.select_dtypes(include=['float64'])
   c
0  1.0
1  2.0
2  1.0
3  2.0
4  1.0
5  2.0
df.select_dtypes(exclude=['int64'])
       b       c
0   True  1.0
1  False  2.0
2   True  1.0
3  False  2.0
4   True  1.0
46
Q

pandas.Index(data=None, dtype=None, copy=False, name=None, tupleize_cols=True, **kwargs)

A

The immutable sequence is used for indexing and alignment. The basic object storing axis labels for all pandas objects.

pd.Index([1, 2, 3])
Int64Index([1, 2, 3], dtype='int64')
pd.Index(list('abc'))
Index(['a', 'b', 'c'], dtype='object')
47
Q

sklearn.model_selection.RandomizedSearchCV(estimator, param_distributions, , n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch=’2n_jobs’, random_state=None, error_score=nan, return_train_score=False)

A

Randomized search on hyper parameters. RandomizedSearchCV фактически такой же, как и GridSearchCV, но он заменяет поиск параметров по сетке GridSearchCV на случайную выборку в пространстве параметров.

Использовать если количество параметров, которые следует учитывать, особенно велико, а масштабы влияния несбалансированный.

rnd_search = RandomizedSearchCV(RandomForestClassifier(), param, 
n_iter =10, cv=9)
rnd_search.fit(X,y)
rnd_search.best_params_
48
Q

pandas.DataFrame.hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, **kwargs)

A

Make a histogram of the DataFrame’s columns. A histogram is a representation of the distribution of data.

data[['GrLivArea']].plot.hist(bins=20)
df = pd.DataFrame({
     'length': [1.5, 0.5, 1.2, 0.9, 3],
     'width': [0.7, 0.2, 0.15, 0.2, 1.1]
     }, index=['pig', 'rabbit', 'duck', 'chicken', 'horse'])
hist = df.hist(bins=3)
49
Q

threshold

A

one of the most basic techniques for Segmentation. You get segments inside each representing something. For example… complex segmentation algorithms might be able to segment out “house-like” structures in an image.