Programming with Python for Data Science Flashcards Preview

STAT Microsoft Data Science > Programming with Python for Data Science > Flashcards

Flashcards in Programming with Python for Data Science Deck (56)
Loading flashcards...
1
Q

What is the Data Analysis process consist of?

A
  • Collecting Data from various sources
  • Wrangling Data to make it more reliable
  • Exploring Data using statistics and visualizations
  • Transforming Data to prepare it for modeling
  • Modeling Data using the right machine learning algorithms
  • Evaluating the results of the data models

https://courses.edx.org/asset-v1:Microsoft+DAT210x+4T2016+type@asset+block@Course_Map.pdf

2
Q

What must you do before collecting data?

A

Before you even start doing that, you should have a question in mind to drive your data collection process.

3
Q

Good to Remember:

A

Data might be collected from a variety of sources in the physical world, such as thermostats, smart cars, satellite transmissions, cameras, logs, and the Internet. As a data scientist, you’ll usually be given data by your clients and supervisors. But if you work on your own passion projects, it’ll be your own responsibility to amass data for analysis.

4
Q

Which statement makes the most sense about data analysis?

A

Special care ought be dedicated to collecting data, so that you have enough to do effective analysis.

5
Q

If you wanted to engage the data analysis process, the best place to start is by…

A

Having a question in mind to drive your data collection process

6
Q

What is Machine Learning?

A

Machine Learning is the name given to generalizable algorithms that enable a computer to carry out a task by examining data rather than hard programming.

With machine learning, results attributed more to your data than to the algorithm, since it’s the data that instructs the computer what to do. In other words, the algorithm or task is generic; but the data is specific to the problem being solved. You might feed a specific machine learning algorithm data about food your wife likes and food she doesn’t like, and the task the algorithm accomplishes is learning how to differentiate between the two. On the other hand, she might feed the same algorithm data about clothes you like / dislike. Without altering a single line of code, the same programmed algorithm can solve the new task based solely on the data!

7
Q

Another explanation of Unsupervised Learning?

A

Given a lot of data, the computer hasn’t the slightest idea what any of it means. Yet, it’s still able to figure out if there are any meaningful groupings and patterns within the data, along with instances where that data seems out of place!

8
Q

Another explanation of Supervised Learning?

A

Unlike unsupervised learning where the computer had no idea what any of the data meant, with supervised learning, the computer is in charge of taking your data and then fitting rules and equations to it. Once it learns these generalized rules, the rules can be applied to data the computer has never seen before.

9
Q

Basic criteria to apply machine learning or not?

A

Not every problem is machine solvable, nor should every problem even be approached with machine learning! If your issue is directly solvable through some simple means, such as a few yes / no decisions, or if it does not require examining loads of data, there is probably a more fitting solution for you than machine learning.

Starting with your expertise in an area, look for interesting problems to tackle. Break those problems down into smaller constituents, so that they’re either entirely solvable with machine learning, or at least partly so.

10
Q

What is the difference between supervised and unsupervised learning?

A

Unsupervised learning attempts to extract patterns; supervised learning tries to fit rules and equations to your data

11
Q

Good to know TitBit:

A

TensorFlow is a critical part of Google’s bread-and-butter search pipeline. Google open sourced this tool without much fear because it’s really the data and not the algorithm that drives their killer insights. With enough good data—and we all know Google knows us better than our parents, siblings, and even spouses—it is simply amazing what you can get a computer to do.

12
Q

How to deciide on algorithms?

A

Given good data, there are a few targeted areas where machine learning really shines. If you can engineer your data driven questions into one or more of these identified areas, you can take full advantage of all machine learning has to offer using out-of-the box algorithms.

13
Q

What is Classification and how does it work?

A

The goal of classification is to find what class a sample belongs to.

Classification falls into the realm of supervised learning because in order for it to work, you have to guide the computer by proving it with examples of correctly labeled records. Once you’re done training the computer, you can test it by seeing how accurately it scores those records.

A class could be something like Windows 10 Mobile, and a sample could be something like phone. To get classification working, you have to feed the machine learning algorithm a decent amount phone examples, some of them labeled Windows 10 Mobile, and others labeled, well… non-Windows 10 Mobile. With enough training samples, a classifier will eventually be able to generalize what similarities constitute a Windows 10 Mobile phone and voilà, you’ve trained a computer to figure out phone types!

14
Q

What is Regression?

A

The goal of regression is to predict a continuous-valued feature associated with a sample. Continuous-valued meaning small changes in the input result in small changes in the output.

With regression, a mathematical relationship is modeled for your samples so that as you gently alter one feature, another feature responds by being altered as well.

Regression falls into the realm of supervised learning because in order for it to work, you have to provide the computer with labeled samples. It then attempts to fit an equation to the samples’ features

15
Q

What is Clustering?

A

The goal of clustering is to automatically group similar samples into sets.

Since a clustering algorithm has no prior knowledge of how the sets should be defined, and furthermore, since the clustering process is unsupervised, the clustering algorithm needs to have a way to tell which samples are the most similar, so it can group them. It does this the same way we humans do: by looking at the various characteristics and features of the sample.

There are different types of clustering algorithms, some supervised, some unsupervised. There are even semi-supervised clustering methods as well.

16
Q

What is Dimensionality Reduction (Unsupervised)?

A

The goal of dimensionality reduction is to systematically and intelligently reduce the number of features considered in a dataset. Stated differently, trim the fat off. Often times, in one’s eagerness to collect enough data for machine learning to be effective, you might add irrelevant features to your dataset. Bad features have the effect of hindering the machine learning process, and make your data harder to understand. Dimensionality reduction attempts to trim your dataset down to the bare essentials needed for decision-making.

Dimensionality reduction falls into the realm of unsupervised learning because you don’t instruct the computer which features you want it to build; the computer infers this information automatically by examining your unlabeled data.

17
Q

What is Reinforcement Learning?

A

The goal of reinforcement learning is to maximize a cumulative reward function (or equivalently, minimize a cumulative cost function), given a set of actions and results. Reinforcement learning is modeled to mimic the way we learn in the real world. We try to solve problems using different techniques. Most of the time, nothing of merit results from our experiments. But occasionally, we stumble upon a set of actions that result in a sweet reward. When this happens, we attempt to repeat these actions that will result in our getting rewarded. If we are rewarded yet again, we further associate those actions with the reward and that is known as the reinforcement cycle. The entire process is also known as performance maximization.

Reinforcement learning is actually a completely different category of learning from supervised and unsupervised learning. It’s closer to supervised learning than it is to unsupervised learning, but you could get away with calling it semi-supervised learning.

18
Q

For data to be usable by SciKit Learn,how should it be organized?

A

To be usable by SciKit-Learn, the machine learning library for Python , your data needs to be organized into matrix ofsamples and features:

19
Q

What are Features in a dataset?

A

Features are those quantitative traits that describe your samples. They might be numeric or textual, for example, CompanyName is a textual feature in your ‘companies’ dataset. Different samples might result in values such as ‘Microsoft’, ‘EdX’, or ‘Coding Dojo’ for the CompanyName feature. If CompanyRating were a numeric feature in the same dataset, its value might have score between 1 and 5 for each company:

20
Q

What are the different kinds of Features?

A
  1. Continuous
  2. Categorical ( Usually Text)
    • Ordinal ( Ordered List)
    • Nominal ( No Order)
21
Q

What is a good rule to remember while collecting data initially?

A

Gather as many samles and features as possible. Do not throw away or delete any samples or features until initial data analysis.

Some time 2 or more weak features may combine together to give you a powerful effect.

22
Q

Good to Remember : Data Collection and Features

A

One of the beauties of machine learning is its ability to discover relationships within your data you might be oblivious to. Two or more seemingly weak features, when combined, might end up being exactly that golden feature you’ve been searching for.

23
Q

Good to Remember as a Data Scientist

A

Your machine learning goal should be to train your algorithms instead of hard coding them.

Think of your machine learning models as if they were small children who have absolutely no knowledge except what you train them with; what information would they need to know to make the right decisions?

24
Q

When building out your dataset initially, what are the three things should you focus on the most?

A
  • Collecting more samples than features, so that the mathematics required for machine learning works out well.
  • Collecting features, even if independently they don’t do a great job at answering your dataset’s question.
  • Let your intuition about th question, and expertise in the domain of your issue, drive you to choosing right features.
25
Q

What are the two data structures in Panda?

A
  1. The first is the series object, a one-dimensional labeled array that represents a single column in your dataset.
  2. The second structure you need to work with is a collection of series called a dataframe. To manipulate a dataset, you first need to load it into a dataframe. Different people prefer alternative methods of storing their data, so Pandas tries to make loading data easy no matter how it’s stored.
26
Q

What is indexing in Pandas?

A

Selecting a subset of your data from a series or a dataframe.

27
Q

How do you import pandas library?

A

import pandas as pd;

28
Q

What is the difference between df.colA or df[[‘colA’]]?

Important to Know

A

One returns a series, the other returns a dataframe.

29
Q

>>> import pandas as pd

>>> animals = pd.read_csv(‘animals.csv’, sep=’\t’)

>>> animals.columns

Given the above, fictional dataset, which command would you use to select the: family, species, and populationcolumns of the first five rows?

A

animals.loc[0:4, [‘family’, ‘species’, ‘population’]]

30
Q

What is an important difference between the:

  • .loc[]
  • .ix[]
  • .iloc[] methods
A

The important difference is that .loc[] and .ix[] are inclusive of the range of values selected, where.iloc[] is non-inclusive. In that sense, df.loc[0:1, :] would select the first two rows, but only the first row would be returned using df.iloc[0:1, :].

31
Q

What does df.describe()method return?

A

Generate various summary statistics, excluding NaN values.

32
Q

features = [“The enchanted forest beamed with magic once the prince was born.”, “Jinto’s life changed forever when his planet surrendered without firing a single shot.”]

Which technique would you use to encode the above features?

A

Encode using count vectorizer.

33
Q

Ordinal and Nominal features should be encoded using:

A

.astype

34
Q

How is missing data represented in Pandas?

A

Pandas represents missing data internally using Numpy’s np.nan

35
Q

What problems can missing data create?

A

If not accounted for, missing data might lead you to erroneous conclusions about your samples by resulting in incorrect sums and means, and even by skewing distributions.

36
Q

What are some ways of dealing with nan values in a dataset?

A

Any time a nan is encountered, replace it with a scalar value:

df. my_feature.fillna( df.my_feature.mean() )
df. fillna(0)

When a nan is encountered, replace it with the immediate,previous, non-nan value.

df. fillna(method=’ffill’) # fill the values forward
df. fillna(method=’bfill’) # fill the values in reverse
df. fillna(limit=5)

Fill out nans by interpolating over them with the non-nan values that come immediately before and after. You can select the interpolation method you’d like to use, such as nearest, cubic, spline and more.

df.interpolate(method=’polynomial’, order=2)

37
Q

.convert_objects(convert_numeric=True) What does this method do?

A

This method converts a dataframe object into a numeric value where possible

38
Q

What does the .unique() method do?

A

returns the count of unique value in a dataframe column

39
Q

Plot a parallel coordinates chart, grouped by the wheat_type feature. Be sure to set the optional display parameter alpha to 0.4

A

parallel_coordinates(df,’wheat_type’,alpha=0.4)
plt.show()

40
Q

Plot a andrews curve chart, grouped by the wheat_type feature.

A

andrews_curves(df,’wheat_type’)
plt.show()

41
Q

What is PCA?

A

Principal Component Analysis.

PCA falls into the group of dimensionality reduction algorithms.

42
Q

What are some of the weaknesses of PCA?

A
  1. The first is that it is sensitive to the scaling of your features.
  2. PCA is extremely fast, but for very large datasets it might take a while to train
  3. The last issue to keep in mind is that PCA is a linear transformation only!
43
Q

Please complete the sentence so that it makes the most sense:

Principal component analysis…

A

Ensures each newly computed feature is orthogonal to all previously computed ones, minimizing overlaps.

Explanation

Unsupervised learning methods do not require or examine your labels / classifications. In fact, you should remove them from your dataset when you perform PCA on it.

PCA also is a linear transformation.

Th last option was tricky; if you have more features than samples, PCA will only be able to create as many components as the number of samples you have. It is more ideal to have more samples than features, so that your hard limit of how many components you reduce to is not based on how much data you’ve collected.

44
Q

Which of these statements is problematic?

A

Since PCA is sensitive to feature scaling, if you have a feature that is a linear transformation of the other, e.g. feature2 = 10 * feature1, then both features will be ignored

FALSE

Explanation

PCA is sensitive to feature scaling, but having two 100% correlated features won’t result in them both being ignored. Rather PCA will recognize that they both measure in the same ‘direction’, perhaps giving them more weight if anything.

45
Q

Good to Remember:

A

Whenever you’re given a dataset, the first thing you should do is find out as much about it as possible, both by reading up on any metadata, as well as by prodding through the actual data.

46
Q

Isomap is most beneficial…

A

When a non-linear, geometric structure is expressed in your data

Explanation

If your data doesn’t have an embedding, isomap might still be able to find some interesting information for you, but it works best when there actually is a non-linear pattern embedded in your higher dimensionality data.

If you don’t know how many samples are needed to capture that pattern and fail in collecting enough, isomap won’t be as effective as if you collected _more_ samples than were needed.

If your data actually has a linear relationship, then using PCA will get you were you want to be more efficiently than isomap.

47
Q

Which of the following explanations of isomap is true?

A

A one sentence summary of isomap’s implementation is that at its core, it is essentially a node distance map that has been fed into a special type of PCA.

Explanation

One of isomap’s greatest weaknesses is that noisy data might short circuiting the actual geodesic path. In such cases, isomap will prefer the noisy path to the actual path and produce an incorrectly warped mapping.

Isomap is slower than PCA because isomap essentially implements a multi-dimensional scaling (similar to PCA) through projection; but in addition to that, it also has to calculate the nearest neighbor map.

As mentioned in the lecture, even if the distance metric isn’t 100% accurate, isomap can still function reasonably. Particularly for distant nodes, they won’t be included in the *nearest* neighbors list to start with.

48
Q

In data wrangling, how do you treat irrelevant, incomplete, and missing data ?

A

In data wrangling, irrelevant, incomplete, and missing data is either defaulted to a specific value or removed entirely.

NaNs are stripped out, typographical errors are patched, and perhaps even some data normalization occurs

49
Q

What is the difference between K Means and K Nearest Neighbor classifiers?

A

K-Means is an unsupervised clustering algorithm, where K-Neighbors is a supervised classification algorithm. If clustering is the process of separating your samples into groups, then classification would be the process of assigning samples into those groups. Given a set of groups, take a set of samples and mark each sample as being a member of a group. Each group being the correct answer, label, or classification of the sample.

50
Q

What is a model?

A

A model is any formula or algorithm designed to represent the mechanics of your data.

51
Q

What are estimators is scikit learn?

A

SciKit-Learn’s documentation referrers to all of the machine learning methods in their library as estimators

52
Q

What are predictors in scikit?

A

Lastly, when dealing with data that come with attributes you want to learn to predict, such as supervised learning problems, the estimators designed to handle this are called predictors.

53
Q

Confusion Matrix plotting….good to remember

A

Traditionally, the predicted targets are aligned on the X-axis of the matrix, and the true values are aligned on the Y-axis

54
Q

What are SVMs?

A

Support vector machines are a set of supervised learning algorithms that you can use for classification, regression and outlier detection purposes.

55
Q

Goof to Remember about the working of a SVM?

A

Support vector classifier behaves just like your stick placement. The two things it wishes to fulfill, in order of priority, are first finding a way to separate your data. It does this by looking at the balls from either class that are closest to one another, or in machine learning lingo, the support vectors. The algorithm then ensures it separates your data in the best way possible by orienting the boundary such that the gap, or margin between it and your support vector samples is maximized.

56
Q
A