Data Science for Dummies Flashcards

1
Q

What are the 3 types of data and examples of each?

A

Structured - stored, processed, and manipulated in a traditional relational database management system (RDBMS), i.e. a MySQL database that uses tabular data

Unstructured - data generated from human activities that doesn’t fit into a structured database format, i.e. emails, Word docs, AV files

Semistructured - data that doesn’t fit into a structured database system but is organizable by tags that are useful for creating a form of order/hierarchy. Examples include XML (a file used to store data in the form of hierarchical elements) and JSON files ( JSON file is a file that stores simple data structures and objects in JavaScript Object Notation (JSON) format, which is a standard data interchange format. It is primarily used for transmitting data between a web application and a server. JSON files are lightweight, text-based, human-readable, and can be edited using a text editor.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is big data?

A

Data that exceeds the processing capacity of conventional database systems because it’s too big or lacks the structural requirements of a traditional database architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does it mean to query data?

A

Write commands to extract relevant datasets from data storage systems (usually you do this with SQL, structured query language)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Hadoop?

A

A platform for batch-processing and storing large volumes of data. It was designed to boil down big data into smaller datasets that are more manageable for data scientists to analyze. It’s been in decline in popularity since 2015

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 3 characteristics that define big data?

A

Volume, velocity, and variety. Because these 3 Vs are ever expanding, newer, more innovative data technologies must be developed to manage big data problems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s the size of big data volume?

A

As low as 1 terabyte and has no upper limit. If your org owns at least 1 terabyte of data, the data technically qualifies as Big Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does it mean that most big data is “low value?”

A

Big data is composed of huge number of very small transactions in a lot of formats that only have value once they’re aggregated (data engineers) and analyzed (data scientists)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is data velocity important?

A

A lot of big data is created using automated processes, and a lot of it is low value. You need systems that are able to ingest a lot of it quickly and generate timely and valuable insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is data velocity?

A

Data velocity is data volume per unit in time. Big data enters an average system at velocities between 30 kilobytes/second to 30 gigabytes per second

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does latency refer to?

A

Related to data velocity. A system’s delay in moving data after it’s been instructed to do so (every data system has this). Many data engineered systems are required to have latency <100 milliseconds from data creation to the time the system responds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is throughput?

A

Related to data velocity. A characteristic describing a system’s capacity for work per unit of time. The capabilities of data handling and processing technologies limit data velocities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some of the tools (3) that intake data into a system (data ingestion)?

A

Apache scoop - quickly transfer data back and forth between a relational data system and the Hadoop distributed file system (HDFS)

Apache Kafka
Apache Flume

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does data variety refer to?

A

High variety data comes from a multitude of sources with different underlying structures (structured, unstructured, or semistructured)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a data lake?

A

A nonhierarchical data storage system used to hold huge volumes of multistructured raw data within a flat storage architecture. In other words, a collection of records that come in uniform format and are not cross-referenced in any way. HDFS can be used as a data lage storage reposity as can AWS S3 platform (one of the more popular cloud architectures for storing big data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a data warehouse?

A

A data warehouse is a centralized data repository that you can use to store and access only structured data. A more traditional warehouse system is a data mart, a storage system for structured data that you can use to store one particular focus area of data belonging to one line of business in the company

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is machine learning?

A

The practice of applying algorithms to learn from and make automated predictions from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a machine learning engineer?

A

Hybrid between a software engineer and a data scientist (NOT data engineer). A software engineer who is skilled enough in data science to deploy advanced data science models within the applications they build, bringing ML models into production in a live environment like a SaaS product or a webpage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a data engineer?

A

Build and maintain data systems for overcoming data processing bottlenecks and data handling problems that arise from handling the high volume, velocity, and variety of big data. They use SWE to design systems for and solve problems with handling and manipulating big data sets. They often have experience working with and designing real time processing frameworks and massively parallel processing platforms as well as RDBMSs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What programs to data engineers code in?

A

Java, C++, Scala, or Python. THey also know how to deploy Hadoop MapReduce or Spark to handle, process, and refine big datasets into more manageable sizes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the purpose of data engineering?

A

Engineer large scale data solutions by building coherent, modular, scalable data processing platforms that data scientists can use to derive insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

True or false: Data engineering involves engineering a built system

A

False. It involves the designing, building, and implementing of software solutions to problems in the data world.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What’s the difference between a data engineer, ML engineer, and data scientist?

A

The data engineer will store, migrate, and process your data, data scientist will make sense of the data, and ML engineer will bring ML models into production

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the big cloud data storage services?

A

AWS, Google cloud, Microsoft azure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why is cloud data storage more beneficial than on premise data storage?

A

-cloud service providers take care of work to configure and maintain computing resources which makes it easier to use the data
-more flexibility - can turn off cloud service for what’s no longer needed vs. having servers on premise you’re not using
-more secure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is serverless computing?

A

Computing done in the cloud vs. on a desktop or on premise. A physical server exists but is supported by the cloud computing company you retain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is FaaS and some examples?

A

Function as a service. It’s a containerized cloud computing service that makes it easier to execute code in a cloud environment without needing to set up code infrastructure (data science model runs directly in the container). Examples: AWS Lambda, Google cloud functions, azure functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is kubernetes?

A

Open source software suite that manages and coordinates deployment, scaling, and management of containerized applications across clusters of worker nodes. It helps software developers build and scale apps quickly but needs data engineering expertise to set up quickly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What does it mean for a system to be fault tolerant?

A

Built to continue successful operations even if one of sub components fails. Has redundancy in computing nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does it mean for a system to be extensible?

A

It can be extended or shrunk in size without disrupting its operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is parallel processing?

A

Data is processed quickly bc the work required to process the data is distributed across multiple nodes in a system. This configuration allows for simultaneous processing is tasks across multiple nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Name 3 cloud warehouse solutions.

A

Amazing Redshift - big data warehousing service running on data sitting in the cloud

Snowflake - SaaS solution providing parallel processing for structured and semi structured data stored in the cloud on snowflakes servers

Google BigQuery

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

True of false: an RDBMS can handle big data

A

False; rdbms only for tabular data that can be queried by SQL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is NoSQL used for?

A

Better for big data because it can handle structured, unstructured, and semistructured data.

34
Q

What is a key value pair?

A

Pair of data items—key is the record identifier and the value is the data identified by the respective key

35
Q

What are the 4 categories of nonrelational databases NoSQL offers?

A

Graph databases
Document databases (aka document store)
Key-value stores
Column family stores

36
Q

What is a real time processing framework?

A

A framework that processes data in real time as the data streams and flows into the system. The frameworks process data in micro batches and return very quick results.

37
Q

What is in-memory?

A

In-memory refers to processing data within the computer’s memory without reading and writing its computational results to a disk; this is much faster but cannot process much data per processing interval

38
Q

What is machine learning?

A

The practice of applying algorithmic models to data repeatedly so that your computer discovers hidden patterns or trends you can use to make predictions (aka algorithmic learning)

39
Q

What are the 3 steps in machine learning

A

Setup - acquiring data, preprocessing it, selecting the variable for the task at hand (feature selection), and breaking the data into training/test data sets
Learning - involves model experimentation, training, building, and testing. Use the training data to train the model and the test data to test the accuracy of the model’s predictions
Application - model deployment and prediction

40
Q

How should you break data into test and training sets?

A

Apply random sampling to 2/3 of the original data set in order to train the model and use the remaining 1/3 for evaluating the model’s predictions.

41
Q

What is a random sample?

A

A random sample contains observations that all have an equal probability of being selected in the original data set

42
Q

In ML, what is an instance?

A

The same as a row in a data table, an observation in stats, and a data point. AKA a case.

43
Q

In ML, what is a feature?

A

The same as a column or field in a data table, a variable in stats. In regression models, a feature is also the independent variable

44
Q

In ML, what is a target variable?

A

The same as a dependent variable in stats

45
Q

In ML, what is feature selection and how is it different from feature engineering?

A

Feature selection is the process of selecting appropriate variables, whereas feature engineering is where you design input variables from an underlying data set (used when your model needs a better representation of the problem being solved than is available in the raw data set)

46
Q

What are supervised learning algos? What are their use cases?

A

Supervised learning algos require that input data has labeled features. These algos learn from known features of data to produce an output model that successfully predicts labels for new, incoming, unlabeled data points, i.e. when you have historical values that are good predictors of future events (like survival analysis (aka event history analysis, desigend to predict the time of an event like a mother’s age at first childbirth, fraud detection, etc.).

You use logistic regression in a supervised learning algorithm.

47
Q

What are unsupervised algos?

A

Unsupervised learning algos accept unlabeled data and attempt to group observations into categories based on underlying similarities in input features. Examples of this include principal component analysis, k-means clustering, and singular value decomposition. Use cases include recommendation engines, facial recognition, and customer segmentation

48
Q

What is reinforcement learning in ML?

A

A behavior-based learning model. Given “rewards” based on how it behaves and it learns to behave in a way that maximize rewards

49
Q

What are descriptive statistics?

A

Provide a description that illuminates some characteristic of a numerical data set (i.e. distribution, central tendency, dispersion, etc.)

50
Q

What’s the difference between standard deviation and variance?

A

Standard deviation is the spread of a group of numbers from the mean. The variance measures the average degree to which each point differs from the mean. While standard deviation is the square root of the variance, variance is the average of all data points within a group.

51
Q

True or false: Descriptive statistics posit the X causes Y.

A

False; they only highlight relationships between x and y

52
Q

What are inferential statistics?

A

They carve out a small section of the dataset and attempt to deduce significant information about the larger datasets. Inferential stats, such as regression analysis, DO try to predict by studying causation.

53
Q

What are the 3 types of distributions?

A

Normal distribution (numeric continuous) - bell curve

Binomial distribution (numeric discrete) - model # of successes that can occur in a certain number of attempts when only two outcomes are possible. Binary variables (only two outcomes possible) have a binomial distribution

Categorical distributions (non-numeric): non-numerical categorical variables or ordinal variables (ordered categorical variable, i.e. airline classes)

54
Q

What is Naive Bayes and what is it used for? What are the 3 types?

A

It’s an ML method used to predict the likelihood that an event will occur given evidence defined in your data features (conditional probability). It is based on classification and regression and us useful to classify text data (i.e. model predicts whether an email is spam (the event) based on features gathered from content in a repository (evidence).)

NB is good for classifying text data

3 types:
-MultinomialNB, BernoulliNB, GaussianNB

55
Q

What’s the difference between a binomial and multinomial distribution?

A

A multinomial distribution can produce 2+ outcomes and binomial can produce only 2

56
Q

TRUE OR FALSE: Many ML methods assume features are independent

A

True; you need to test if they’re independent by evaluating their correlation. Correlations with an r value close to 0 could indicate variables are indepednent.

57
Q

What is an a priori assumption?

A

Predictions are based on an assumption that past conditions still hold true

58
Q

What is the Pearson correlation and what assumptions is it based on?

A

A method for measuring the linear relationship/determining if there is a relationship between two continuous variables

In order to use the Pearson correlation, you must have:
1. Data is normally distributed
2. Have continuous, numeric variables
3. Variables are linearly related

59
Q

What is Spearman’s rank correlation?

A

A popular test for determining correlation between ordinal variables.

It assumes:
1. Your variables are ordinal
2. Your variables are related nonlinearly (tell by looking at the graph)
3. Data are nonnormally distributed

Imagine you have two sets of data, like the heights of students and their test scores. Spearman’s rank correlation helps us figure out if there’s a relationship between these two sets of data, but it doesn’t assume that the relationship is a straight line.

Here’s how it works:

  1. Ranking:
    • First, we rank the values in each set. Ranking is like giving each value a position from smallest to largest. If two students have the same height or test score, they get the same rank.
  2. Differences between Ranks:
    • Then, we look at the differences in ranks for each pair of values. For example, if one student is ranked 3rd in height and 5th in test score, the difference is 2 (5 - 3).
  3. Squaring the Differences:
    • We square these differences. Squaring just means multiplying a number by itself. So, if the difference was 2, we square it to get 4 (2 * 2).
  4. Adding Up the Squares:
    • We add up all these squared differences.
  5. Calculating the Correlation:
    • Finally, we use a formula to calculate a number called Spearman’s rank correlation coefficient (let’s call it ρ). This number tells us how related the two sets of data are.
    • If ρ is close to 1, it means there’s a strong relationship: as one set goes up, the other tends to go up too.
    • If ρ is close to -1, it means there’s a strong relationship, but as one set goes up, the other tends to go down.
    • If ρ is close to 0, it means there’s not much of a relationship.

So, Spearman’s rank correlation helps us understand if there’s a connection between two sets of data, without assuming that the connection is a perfect straight line. It’s useful when the data doesn’t follow a normal pattern or when we’re working with ranks or orderings instead of actual numbers.

60
Q

What does it mean to reduce a dataset’s dimensionality?

A

Reduce a dataset’s feature count without losing the important information the datasets contains by compressing its features’ information into synthetic variables you can subsequently utilize to make predictions or as input into another ML model

61
Q

What is SVD?

A

Singular value decomposition.
Allows you to reduce the dimensionality of your dataset (the number of features that you track when carrying out an analysis). It allows you to compress dataset and remove redundant information and noise.

SVD is applied to analyze principal components from large, noise, sparse datasets, an ML approach called Principal Component Analysis (PCA).

62
Q

What’s the difference between PCA and SVD?

A

PCA = principal component analysis
SVD = singular value composition

PCA assumes you’re working with a 1x1 square input matric; if input matrix is not a square, use SVD

63
Q

What does CVE stand for?

A

Cumulative variance explained. The lower the CVE, the more you should take your model’s results with a grain of salt. (don’t worry about going into the weeds on this just know roughly what it’s related to)

64
Q

What are the different ways you can reduce dimensionality of your dataset?

A

SVD = singular value decomposition
Factor analysis
PCA = Principal component analysis

65
Q

What are latent variables?

A

Meaningful inferred variables that underlie a dataset but are not directly observable

66
Q

What is factor analysis?

A

The process of fitting a model to prepare a dataset for analysis by reducing its dimensionality and information redundancy. It compresses dataset’s information into a reduced set of non-information-redundant latent variables

67
Q

What is PCA?

A

Principal component analysis is related to SVD. It’s an unsupervised method that finds relationships between features in your data set then transforms and reduces them to a set of non-information-redundant principal components - uncorrelated features that explain the information that’s contained within the dataset. It’s a refined representation of the dataset without redundancy, noise, or outliers.

68
Q

What does MCDM stand for?

A

Multiple criteria decision making.

69
Q

What is linear regression?

A

A ML method you can use to describe and quantify the relationship between your target variable, y (aka the predictant) and the dataset features you’ve chosen to use as predictor variables (aka dataset X in ML).

70
Q

What are the limitations of linear regression?

A
  1. Works only with numerical variables (not categorical)
  2. If dataset is missing values, you will have problems
  3. Outliers
  4. Assumes a linear relationship exists
  5. Assumes features are independent
  6. Prediction errors (residuals) should be normally distributed

Needs at least 20 observations per predictive feature for reliable results

71
Q

What is logistic regression?

A

An ML method you can use to estimate values for a categorical target variable based on your selected features. Your target variable should be numeric and contain the target variable’s class/category. In addition to predicting the class of observation of your target variable, it indicates the probability for each of its estimates.

72
Q

What are the requirements for logistic regression?

A
  1. Does not need to be a linear relationship between features and target variable
  2. Residuals do not have to be normally disttributed
  3. Predictive features aren’t required to have a normal distribution
73
Q

What are 4 limitations of logistic regression?

A
  1. Missing values should be treated or removed
  2. Your target variables must be binary or ordinal (1 for yes and 0 for no)
  3. Predictive features should be independent of each other

Needs at least 50 observations per predictive feature to generate reliable results

74
Q

What is a least squares regression line?

A

A method that fits a linear regression line to a dataset. You square the vertical distance values that describe the distances between the data points and the best fit line, add up those squared distances, and adjust the placement of the line so the summed squared distance value is minimized.

75
Q

What is time series?

A

A collection of data on attribute values over time. They are used to predict future instances of the measure based on past observational data.

76
Q

What are the 5 different patterns in time series?

A

Constant time series
Trended time series
Untrended seasonal time series
Trended seasonal time series
Nonstationary processes- unpredictable behavior not related to seasonality resulting from economic or industry-wide conditions instead. These can’t be forecasted

77
Q

What’s the difference between multivariate and univariate analysis?

A

Multivariate analysis is the analysis of relationship between multiple variables; univariate analysis is the quantitative analysis of only one variable at a time.

78
Q

What is ARMA?

A

Autoregressive moving average is a class of forecasting methods you can use to predict future values from current and historical data. It combines autoregression techniques (analyses that assume the previous observations are good predictions of future values) and moving average techniques (are like a smoothing tool. They help you see the forest (long-term trends) without being distracted by the trees (short-term fluctuations), making it easier to interpret and analyze time-varying data.)

79
Q
A
80
Q

What is regression analysis used for?

A

It allows us to quantify the relationship between a particular variable and an outcome we care about while controlling for other factors