Data Science for Dummies Flashcards

Question

What is serverless computing?

Answer 1

Computing done in the cloud vs. on a desktop or on premise. A physical server exists but is supported by the cloud computing company you retain

Answer 2

Function as a service. It’s a containerized cloud computing service that makes it easier to execute code in a cloud environment without needing to set up code infrastructure (data science model runs directly in the container). Examples: AWS Lambda, Google cloud functions, azure functions

Answer 3

Open source software suite that manages and coordinates deployment, scaling, and management of containerized applications across clusters of worker nodes. It helps software developers build and scale apps quickly but needs data engineering expertise to set up quickly

Answer 4

Built to continue successful operations even if one of sub components fails. Has redundancy in computing nodes.

Answer 5

It can be extended or shrunk in size without disrupting its operations

Answer 6

Data is processed quickly bc the work required to process the data is distributed across multiple nodes in a system. This configuration allows for simultaneous processing is tasks across multiple nodes

Answer 7

Amazing Redshift - big data warehousing service running on data sitting in the cloud Snowflake - SaaS solution providing parallel processing for structured and semi structured data stored in the cloud on snowflakes servers Google BigQuery

Answer 8

False; rdbms only for tabular data that can be queried by SQL

Answer 9

Better for big data because it can handle structured, unstructured, and semistructured data.

Answer 10

Pair of data items—key is the record identifier and the value is the data identified by the respective key

Answer 11

Graph databases Document databases (aka document store) Key-value stores Column family stores

Answer 12

A framework that processes data in real time as the data streams and flows into the system. The frameworks process data in micro batches and return very quick results.

Answer 13

In-memory refers to processing data within the computer's memory without reading and writing its computational results to a disk; this is much faster but cannot process much data per processing interval

Answer 14

The practice of applying algorithmic models to data repeatedly so that your computer discovers hidden patterns or trends you can use to make predictions (aka algorithmic learning)

Answer 15

Setup - acquiring data, preprocessing it, selecting the variable for the task at hand (feature selection), and breaking the data into training/test data sets Learning - involves model experimentation, training, building, and testing. Use the training data to train the model and the test data to test the accuracy of the model's predictions Application - model deployment and prediction

Answer 16

Apply random sampling to 2/3 of the original data set in order to train the model and use the remaining 1/3 for evaluating the model's predictions.

Answer 17

A random sample contains observations that all have an equal probability of being selected in the original data set

Answer 18

The same as a row in a data table, an observation in stats, and a data point. AKA a case.

Answer 19

The same as a column or field in a data table, a variable in stats. In regression models, a feature is also the independent variable

Answer 20

The same as a dependent variable in stats

Answer 21

Feature selection is the process of selecting appropriate variables, whereas feature engineering is where you design input variables from an underlying data set (used when your model needs a better representation of the problem being solved than is available in the raw data set)

Answer 22

Supervised learning algos require that input data has labeled features. These algos learn from known features of data to produce an output model that successfully predicts labels for new, incoming, unlabeled data points, i.e. when you have historical values that are good predictors of future events (like survival analysis (aka event history analysis, desigend to predict the time of an event like a mother's age at first childbirth, fraud detection, etc.). You use logistic regression in a supervised learning algorithm.

Answer 23

Unsupervised learning algos accept unlabeled data and attempt to group observations into categories based on underlying similarities in input features. Examples of this include principal component analysis, k-means clustering, and singular value decomposition. Use cases include recommendation engines, facial recognition, and customer segmentation

Answer 24

A behavior-based learning model. Given "rewards" based on how it behaves and it learns to behave in a way that maximize rewards

Answer 25

Provide a description that illuminates some characteristic of a numerical data set (i.e. distribution, central tendency, dispersion, etc.)

Answer 26

Standard deviation is the spread of a group of numbers from the mean. The variance measures the average degree to which each point differs from the mean. While standard deviation is the square root of the variance, variance is the average of all data points within a group.

Answer 27

False; they only highlight relationships between x and y

Answer 28

They carve out a small section of the dataset and attempt to deduce significant information about the larger datasets. Inferential stats, such as regression analysis, DO try to predict by studying causation.

Answer 29

Normal distribution (numeric continuous) - bell curve Binomial distribution (numeric discrete) - model # of successes that can occur in a certain number of attempts when only two outcomes are possible. Binary variables (only two outcomes possible) have a binomial distribution Categorical distributions (non-numeric): non-numerical categorical variables or ordinal variables (ordered categorical variable, i.e. airline classes)

Answer 30

It's an ML method used to predict the likelihood that an event will occur given evidence defined in your data features (conditional probability). It is based on classification and regression and us useful to classify text data (i.e. model predicts whether an email is spam (the event) based on features gathered from content in a repository (evidence).) NB is good for classifying text data 3 types: -MultinomialNB, BernoulliNB, GaussianNB

Answer 31

A multinomial distribution can produce 2+ outcomes and binomial can produce only 2

Answer 32

True; you need to test if they're independent by evaluating their correlation. Correlations with an r value close to 0 could indicate variables are indepednent.

Answer 33

Predictions are based on an assumption that past conditions still hold true

Answer 34

A method for measuring the linear relationship/determining if there is a relationship between two continuous variables In order to use the Pearson correlation, you must have: 1. Data is normally distributed 2. Have continuous, numeric variables 3. Variables are linearly related

Answer 35

A popular test for determining correlation between ordinal variables. It assumes: 1. Your variables are ordinal 2. Your variables are related nonlinearly (tell by looking at the graph) 3. Data are nonnormally distributed Imagine you have two sets of data, like the heights of students and their test scores. Spearman's rank correlation helps us figure out if there's a relationship between these two sets of data, but it doesn't assume that the relationship is a straight line. Here's how it works: 1. **Ranking:** - First, we rank the values in each set. Ranking is like giving each value a position from smallest to largest. If two students have the same height or test score, they get the same rank. 2. **Differences between Ranks:** - Then, we look at the differences in ranks for each pair of values. For example, if one student is ranked 3rd in height and 5th in test score, the difference is 2 (5 - 3). 3. **Squaring the Differences:** - We square these differences. Squaring just means multiplying a number by itself. So, if the difference was 2, we square it to get 4 (2 * 2). 4. **Adding Up the Squares:** - We add up all these squared differences. 5. **Calculating the Correlation:** - Finally, we use a formula to calculate a number called Spearman's rank correlation coefficient (let's call it ρ). This number tells us how related the two sets of data are. - If ρ is close to 1, it means there's a strong relationship: as one set goes up, the other tends to go up too. - If ρ is close to -1, it means there's a strong relationship, but as one set goes up, the other tends to go down. - If ρ is close to 0, it means there's not much of a relationship. So, Spearman's rank correlation helps us understand if there's a connection between two sets of data, without assuming that the connection is a perfect straight line. It's useful when the data doesn't follow a normal pattern or when we're working with ranks or orderings instead of actual numbers.

Answer 36

Reduce a dataset's feature count without losing the important information the datasets contains by compressing its features' information into synthetic variables you can subsequently utilize to make predictions or as input into another ML model

Answer 37

Singular value decomposition. Allows you to reduce the dimensionality of your dataset (the number of features that you track when carrying out an analysis). It allows you to compress dataset and remove redundant information and noise. SVD is applied to analyze principal components from large, noise, sparse datasets, an ML approach called Principal Component Analysis (PCA).

Answer 38

PCA = principal component analysis SVD = singular value composition PCA assumes you're working with a 1x1 square input matric; if input matrix is not a square, use SVD

Answer 39

Cumulative variance explained. The lower the CVE, the more you should take your model's results with a grain of salt. (don't worry about going into the weeds on this just know roughly what it's related to)

Answer 40

SVD = singular value decomposition Factor analysis PCA = Principal component analysis

Answer 41

Meaningful inferred variables that underlie a dataset but are not directly observable

Answer 42

The process of fitting a model to prepare a dataset for analysis by reducing its dimensionality and information redundancy. It compresses dataset's information into a reduced set of non-information-redundant latent variables

Answer 43

Principal component analysis is related to SVD. It's an unsupervised method that finds relationships between features in your data set then transforms and reduces them to a set of non-information-redundant principal components - uncorrelated features that explain the information that's contained within the dataset. It's a refined representation of the dataset without redundancy, noise, or outliers.

Answer 44

Multiple criteria decision making.

Answer 45

A ML method you can use to describe and quantify the relationship between your target variable, y (aka the predictant) and the dataset features you've chosen to use as predictor variables (aka dataset X in ML).

Answer 46

1. Works only with numerical variables (not categorical) 2. If dataset is missing values, you will have problems 3. Outliers 4. Assumes a linear relationship exists 5. Assumes features are independent 6. Prediction errors (residuals) should be normally distributed Needs at least 20 observations per predictive feature for reliable results

Answer 47

An ML method you can use to estimate values for a categorical target variable based on your selected features. Your target variable should be numeric and contain the target variable's class/category. In addition to predicting the class of observation of your target variable, it indicates the probability for each of its estimates.

Answer 48

1. Does not need to be a linear relationship between features and target variable 2. Residuals do not have to be normally disttributed 3. Predictive features aren't required to have a normal distribution

Answer 49

1. Missing values should be treated or removed 2. Your target variables must be binary or ordinal (1 for yes and 0 for no) 3. Predictive features should be independent of each other Needs at least 50 observations per predictive feature to generate reliable results

Answer 50

A method that fits a linear regression line to a dataset. You square the vertical distance values that describe the distances between the data points and the best fit line, add up those squared distances, and adjust the placement of the line so the summed squared distance value is minimized.

Answer 51

A collection of data on attribute values over time. They are used to predict future instances of the measure based on past observational data.

Answer 52

Constant time series Trended time series Untrended seasonal time series Trended seasonal time series Nonstationary processes- unpredictable behavior not related to seasonality resulting from economic or industry-wide conditions instead. These can't be forecasted

Answer 53

Multivariate analysis is the analysis of relationship between multiple variables; univariate analysis is the quantitative analysis of only one variable at a time.

Answer 54

Autoregressive moving average is a class of forecasting methods you can use to predict future values from current and historical data. It combines autoregression techniques (analyses that assume the previous observations are good predictions of future values) and moving average techniques (are like a smoothing tool. They help you see the forest (long-term trends) without being distracted by the trees (short-term fluctuations), making it easier to interpret and analyze time-varying data.)

Answer 55

It allows us to quantify the relationship between a particular variable and an outcome we care about while controlling for other factors

Data Science for Dummies Flashcards

(80 cards)