Programming with Python for Data Science Flashcards

Question

What are the two data structures in Panda?

Answer 1

1. The first is the series object, a one-dimensional labeled array that represents a single column in your dataset. 2. The second structure you need to work with is a collection of series called a dataframe. To manipulate a dataset, you first need to load it into a dataframe. Different people prefer alternative methods of storing their data, so Pandas tries to make loading data easy no matter how it's stored.

Answer 2

Selecting a subset of your data from a series or a dataframe.

Answer 3

import pandas as pd;

Answer 4

One returns a series, the other returns a dataframe.

Answer 5

**animals.loc[0:4, ['family', 'species', 'population']]**

Answer 6

The important difference is that .loc[] and .ix[] are inclusive of the range of values selected, where.iloc[] is non-inclusive. In that sense, df.loc[0:1, :] would select the first two rows, but only the first row would be returned using df.iloc[0:1, :].

Answer 7

Generate various summary statistics, excluding NaN values.

Answer 8

Encode using **count vectorizer**.

Answer 9

Pandas represents missing data internally using Numpy's np.nan

Answer 10

If not accounted for, missing data might lead you to erroneous conclusions about your samples by resulting in incorrect sums and means, and even by skewing distributions.

Answer 11

Any time a nan is encountered, replace it with a scalar value: df. my\_feature.fillna( df.my\_feature.mean() ) df. fillna(0) When a nan is encountered, replace it with the immediate,previous, non-nan value. df. fillna(method='ffill') # fill the values forward df. fillna(method='bfill') # fill the values in reverse df. fillna(limit=5) Fill out nans by interpolating over them with the non-nan values that come immediately before and after. You can select the interpolation method you'd like to use, such as nearest, cubic, spline and more. df.interpolate(method='polynomial', order=2)

Answer 12

This method converts a dataframe object into a numeric value where possible

Answer 13

returns the count of unique value in a dataframe column

Answer 14

parallel\_coordinates(df,'wheat\_type',alpha=0.4) plt.show()

Answer 15

andrews\_curves(df,'wheat\_type') plt.show()

Answer 16

Principal Component Analysis. PCA falls into the group of dimensionality reduction algorithms.

Answer 17

1. The first is that it is sensitive to the scaling of your features. 2. PCA is extremely fast, but for very large datasets it might take a while to train 3. The last issue to keep in mind is that PCA is a linear transformation only!

Answer 18

Ensures each newly computed feature is orthogonal to all previously computed ones, minimizing overlaps. Explanation Unsupervised learning methods do not require or examine your labels / classifications. In fact, you should remove them from your dataset when you perform PCA on it. PCA also is a linear transformation. Th last option was tricky; if you have more features than samples, PCA will only be able to create as many components as the number of samples you have. It is more ideal to have more samples than features, so that your hard limit of how many components you reduce to is not based on how much data you've collected.

Answer 19

Since PCA is sensitive to feature scaling, if you have a feature that is a linear transformation of the other, e.g. feature2 = 10 \* feature1, then both features will be ignored FALSE Explanation PCA is sensitive to feature scaling, but having two 100% correlated features won't result in them both being ignored. Rather PCA will recognize that they both measure in the same 'direction', perhaps giving them more weight if anything.

Answer 20

Whenever you're given a dataset, the first thing you should do is find out as much about it as possible, both by reading up on any metadata, as well as by prodding through the actual data.

Answer 21

When a non-linear, geometric structure is expressed in your data ## Footnote Explanation If your data doesn't have an embedding, isomap might still be able to find some interesting information for you, but it works best when there actually is a non-linear pattern embedded in your higher dimensionality data. If you don't know how many samples are needed to capture that pattern and fail in collecting enough, isomap won't be as effective as if you collected \_more\_ samples than were needed. If your data actually has a linear relationship, then using PCA will get you were you want to be more efficiently than isomap.

Answer 22

A one sentence summary of isomap's implementation is that at its core, it is essentially a node distance map that has been fed into a special type of PCA. ## Footnote Explanation One of isomap's greatest weaknesses is that noisy data might short circuiting the actual geodesic path. In such cases, isomap will prefer the noisy path to the actual path and produce an incorrectly warped mapping. Isomap is slower than PCA because isomap essentially implements a multi-dimensional scaling (similar to PCA) through projection; but in addition to that, it also has to calculate the nearest neighbor map. As mentioned in the lecture, even if the distance metric isn't 100% accurate, isomap can still function reasonably. Particularly for distant nodes, they won't be included in the \*nearest\* neighbors list to start with.

Answer 23

In data wrangling, irrelevant, incomplete, and missing data is either defaulted to a specific value or removed entirely. NaNs are stripped out, typographical errors are patched, and perhaps even some data normalization occurs

Answer 24

K-Means is an unsupervised clustering algorithm, where K-Neighbors is a supervised classification algorithm. If clustering is the process of separating your samples into groups, then classification would be the process of assigning samples into those groups. Given a set of groups, take a set of samples and mark each sample as being a member of a group. Each group being the correct answer, label, or classification of the sample.

Answer 25

A model is any formula or algorithm designed to represent the mechanics of your data.

Answer 26

SciKit-Learn's documentation referrers to all of the machine learning methods in their library as estimators

Answer 27

Lastly, when dealing with data that come with attributes you want to learn to predict, such as supervised learning problems, the estimators designed to handle this are called predictors.

Answer 28

Traditionally, the predicted targets are aligned on the X-axis of the matrix, and the true values are aligned on the Y-axis

Answer 29

Support vector machines are a set of supervised learning algorithms that you can use for classification, regression and outlier detection purposes.

Answer 30

Support vector classifier behaves just like your stick placement. The two things it wishes to fulfill, in order of priority, are first finding a way to separate your data. It does this by looking at the balls from either class that are closest to one another, or in machine learning lingo, the support vectors. The algorithm then ensures it separates your data in the best way possible by orienting the boundary such that the gap, or margin between it and your support vector samples is maximized.

Programming with Python for Data Science Flashcards

(56 cards)