Data Science Essentials Flashcards

1
Q

What is a Random Variable?

A

A random variable assigns a numerical value to each possible outcome of a random experiment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 5 Vs of Big Data?

A
  1. Velocity
  2. Veracity
  3. Variability
  4. Volume
  5. Value
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does machine learning include?

A

Machine Learning is a computing technique that has its origins in artificial intelligence (AI) and statistics. Machine Learning solutions include:

  • Classification - Predicting a Boolean true/false value for an entity with a given set of features.
  • Regression - Predicting a real numeric value for an entity with a given set of features.
  • Clustering - Grouping entities with similar features.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the 5 number Summary Statistic contain?

A
  1. Min
  2. Max
  3. Q1
  4. Q2
  5. Q3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Python Merge Data Frames….Good Examples Link

A

http://chrisalbon.com/python/pandas_join_merge_dataframe.html

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is one of the first steps of machine learning?

A

Now in general, the first step in machine learning is to figure out how to represent your data as a vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

CRISP-DM Process?

See Image

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does summary statistics generally contain?

A

Summary statistics generally include the mean, the median and quartiles of the data. This gives you a first quick look at the distribution of data values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the benefit of a scatter plot matrix?

A

Scatter plot matrix methods quickly produce a single overall view of the relationships in a dataset.

The scatter plot matrix allows you to examine the relationships between many variables in one view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The data science process includes the following activities:

A
  • Data selection.
  • Preprocessing.
  • Transformation.
  • Data Mining.
  • Interpretation and evaluation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a discrete random variable?

A

A discrete random variable has a number of

outcomes that you could count.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some aspects of Data Analytic Thinking?

A
  • replace intuition with data driven analytical decisions.
  • Transform raw data to valuable asset
  • Increase pace of action
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

WHAT IS DATA SCIENCE?

A

Data Science is the exploration and quantitative analysis of all available structured and unstructured data to develop understanding, extract knowledge, and formulate actionable results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a continuous variable?

A

A continuous variable is a variable that has an infinite number of possible values. In other words, any value is possible for the variable. A continuous variable is the opposite of a discrete variable, which can only take on a certain number of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Types of Machine Learning algorithms?

A
  1. Linear Regression
  2. Logistic Regression
  3. Decision Tree
  4. SVM
  5. Naive Bayes
  6. KNN
  7. K-Means
  8. Random Forest
  9. Dimensionality Reduction Algorithms
  10. Gradient Boost & Adaboost
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Machine Learning : Good to Remember

A

Machine learning is a super powerful set of techniques for prediction.

Machine learning allows you to make predictions and detect patterns that otherwise would have gone unnoticed.

Now, machine learning started as the subfield of

artificial intelligence, and its goal is to allow computers to learn by example.

17
Q

Good to remember about discrete and conntinuous variables?

A

A discrete variable is a variable whose value is obtained by counting.

Examples: number of students present

number of red marbles in a jar

number of heads when flipping three coins

students’ grade level

A continuous variable is a variable whose value is obtained by measuring.

Examples: height of students in class

weight of students in class

time it takes to get to school

distance traveled between classes

18
Q

Some common functions of cleaning data in Azure?

A

Ingested and joined data from multiple sources.

  • Deleted unnecessary and redundant columns.
  • Consolidated the number of categories of a categorical feature.
  • Treated missing values.
  • Removed duplicate rows.
  • Generated a calculated column
  • Located and treated outliers.
  • Scaled numeric values.
19
Q

What are some important aspects of Data Cleansing?

A

One of the most important aspects of any data science project is to clean, filter, and otherwise transform data to prepare it for use in a model. Common tasks when preparing data include:

  • Identifying and handling missing or duplicate values.
  • Identifying and handling outliers and errors
  • Scaling numeric values to make them easier to compare.
20
Q

What is a conditioned histogram?

A

A conditioned histogram is a histogram of a subset of data conditioned on another variable in the dataset. Often the histogram of a numeric variable is conditioned on a categorical variable. It is also possible to condition a histogram on (generally overlapping) ranges of a numeric variable.

21
Q

How do you judge the quality of your prediction model?

A

The prediction quality should always be judged out of sample. You should make a judgement based on the results of the Test Data Set and not the Training Data Set.