Test 2 Flashcards by Reeda-Marie Chavez

The goal of PCA is to find a high-dimensional representation of the data that maintains as much information as possible.

How well did you know this?

Not at all

Perfectly

Machine learning algorithms are readily available in our tools, such as Altair AI Studio, Azure ML Studio, Python libraries, etc. As a result, it is no longer important to understand the mathematical principles and assumptions behind the algorithms.

How well did you know this?

Not at all

Perfectly

Linear algebra, calculus, optimization, and probability theory are the four mathematical fields mentioned in the textbook.

How well did you know this?

Not at all

Perfectly

A derivative is a measure of the sensitivity of a function to changes in the function’s input(s).

How well did you know this?

Not at all

Perfectly

Vectors and matrices are the building blocks in linear algebra.

How well did you know this?

Not at all

Perfectly

The transpose of a matrix is the matrix with its rows and columns inverted.

How well did you know this?

Not at all

Perfectly

Master data management does not require data governance.

How well did you know this?

Not at all

Perfectly

Data classification is a way to define the various levels of confidentiality/security required by the organization.

How well did you know this?

Not at all

Perfectly

Symmetric encryption is a good way for a bank to authenticate your account credentials.

How well did you know this?

Not at all

Perfectly

Information security involves protecting data from unauthorized access.

How well did you know this?

Not at all

Perfectly

Master data is the contextual data about the organization/entity used to increase the informativeness of transaction data.

How well did you know this?

Not at all

Perfectly

Modeling in data science involves creating representations of real-world phenomena.

How well did you know this?

Not at all

Perfectly

In information security, defense in depth is the concept that an organizations uses multiple layers of security to protect sensitive/valuable data assets

How well did you know this?

Not at all

Perfectly

Multiple linear regression improves the power of your analysis by quantifying the cumulative effect of all features.

How well did you know this?

Not at all

Perfectly

While doing our bivariate analysis, we found that the following attributes had the following r-square values with respect to the label attribute:
age: 0.29
education: 0.14
experience: 0.16
Given this information we should expect a regression model that includes these same three attributes against the same label will have an R-square value of 0.59.

How well did you know this?

Not at all

Perfectly

R2 represents how well the regression model explains the variance in the label value.

How well did you know this?

Not at all

Perfectly

Multiple linear regression (MLR) is but one of many algorithms used for multivariate modeling.

How well did you know this?

Not at all

Perfectly

Scaling techniques such as LogNormal, MinMax, Z-score, Tanh, and Logistic are used to adjust the values of numeric variables.

How well did you know this?

Not at all

Perfectly

K-means clustering is more robust to outliers than k-medoids clustering

How well did you know this?

Not at all

Perfectly

Cluster analysis is a form of supervised learning.

How well did you know this?

Not at all

Perfectly

Lower values of the Calinski-Harabasz criterion indicate a better clustering solution.

How well did you know this?

Not at all

Perfectly

K-medoids clustering identifies an actual data point for each cluster that is most centrally located.

How well did you know this?

Not at all

Perfectly

Clustering requires us to specify a label attribute.

How well did you know this?

Not at all

Perfectly

A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.

How well did you know this?

Not at all

Perfectly

Structured data includes text or image data.

Machine learning models can be and often are used without concern for interpretation.

The coefficient of determination (R-squared) indicates how well the predicted values explain the variation in the observed values.

Multiple linear regression differs from linear regression because it is generally used to predict categorical values.

The primary purpose of statistics is to make precise predictions about new observations.

The Mean Absolute Error (MAE) is less sensitive to outliers than the RMSE.

The F-statistic is used to determine if the overall regression model is a good fit for the data.

Logistic regression requires numeric predictor attributes, so categorical attributes need to be converted to numeric attributes before analysis.

The Root Mean Squared Error (RMSE) is expressed in the same units as the dependent variable.

Clustering is a supervised learning method.

The adjusted coefficient of determination penalizes for the number of independent variables in the model.

Quantitative variables can be either discrete or continuous.

Machine learning involves implementing static algorithms to make predictions.

Bias refers to the error due to sensitivity to small fluctuations in the training set.

Deep learning is a subset of machine learning that uses neural networks with many layers.

A 2-nearest neighbor model is more likely to overfit than a 20-nearest neighbor model.

Decision nodes are used in linear regression.

The error rate of a classifier is equal to the number of incorrect decisions made over the total number of decisions made.

Underfitting occurs when the model does not adequately reflect the distribution of the training data.

When using clustering, the target variable does not have to be precisely defined at training time.

In the use phase, k-means algorithms classify new instances by finding the k most similar training instances and applying a combination function to the known values of their target variables.

Random Forests are relatively robust against worthless features.

Convolutional neural networks (CNNs) are an example of algorithmic generation of features.

Weak learners in ensemble methods may perform only slightly better than a random decision.

The Data Preparation phase in CRISP-DM includes selecting data and cleaning data.

Fliers in a Tukey box plot represent data values beyond the cap values.

Derived attributes are new attributes constructed from one or more existing attributes.

The empirical rule states that any data within three standard deviations of the mean is considered an outlier.

Data cleaning and transformation are one-time tasks and do not require iteration in a data science project.

Imputing the null value is usually a better option than using some type of average.

The Tukey box plot is a visual representation of the skewness of a distribution.

Outliers should always be removed from data sets.

The median is generally represented as a line within the quartile box in a Tukey box plot.

The caps in a Tukey box plot can only represent the true min/max values.

The empirical rule is relevant for categorical or binary data.

Missing data can be classified as Missing Completely at Random (MCAR).

The architecture of the original Watson AI was simple and relied on a single machine learning model

Symbolic AI, also known as GOFAI, relies on machine learning algorithms.

Autonomous vehicles are considered the smartest robots built by mankind so far.

According to the author of the Papp, et al chapter on AI, machine learning can be used for purposes other than AI.

Early stage AI work focused exclusively on machine learning algorithms.

The first programming language introduced to help computers achieve symbolic intelligence was Python.

The architecture of AI solutions tends to become simpler over time.

One of the biggest problems in creating a good artificial generalized intelligence model is figuring out how to balance the need to be able to learn quickly from a small number of examples and the need to develop capabilities that apply generally.

The Universal Approximation Theorem states that an artificial neural network with one hidden layer can approximate any mathematical function.

When modeling the crypto market, the authors were able to borrow a model for traditional equities markets and use it as is with excellent results.

Inductive biases help machine learning algorithms learn faster and more efficiently.

Model descriptions include a detailed description of the model and any special features.

The data mining engineer ranks the models according to evaluation criteria.

AI is just a fancy name for machine learning models.

All modeling techniques make the same assumptions about the data.

One disadvantage of using AI Studio's automated feature selection operators is that the selection process is based on feature correlations and does not account for the chosen modeling algorithm.

When creating indicator features for a categorical attribute, you should create an indicator feature for each potential category value.

One approach during the test design phase involves separating the data set into training and test sets.

Parameter settings are adjusted during the data preparation phase.

Classification models predict a numeric label value.

The lifecycle of a modelling and simulation project is linear and straightforward.

Black-Box models provide more insight into actual dynamics than White-Box models.

Verification answers the question, "Is the model developed right?"

The iterative process of modelling is depicted in the following figure: Problem Formulation --> Modeling Concept --> Usefulness --> Validation --> Answer to the Problem - Validation pointing to Modeling Concept and Problem Formulation with "No" - Usefulness pointing to Modeling Concept and Problem Formulation with "No"

Calibration is used when parameter values are unknown and need to be estimated.

According to the textbook, ODE and SD are two of the most common microscopic modeling techniques.

According to the textbook readings, modeling methods can be classified into two primary categories: macroscopic methods and microscopic methods.

The term "dynamic" in modelling refers to time-dependent components.

Partial differential equations (PDEs) are used to describe systems behavior over time.

Documentation is crucial for the reproducibility of simulation models.

Validation answers the question, "Is the right model developed?"

Visualization is not necessary for documenting and validating simulation models.

What is the purpose of data cleaning in a data preparation phase?

To identify and correct errors or inconsistencies in the dataset.

Define feature engineering.

The process of creating new features from existing data to improve model performance.

How do you handle missing values in a dataset?

Techniques include imputation (filling in values) or deletion (removing missing data).

What types of visualizations are commonly used in EDA?

Histograms, box plots, and scatter plots to uncover patterns in data.

Why is correlation analysis important?

It helps examine relationships between variables, indicating potential dependencies.

What are outliers, and why should they be identified?

Outliers are data points that differ significantly from others; they can skew analysis results.

What is the difference between supervised and unsupervised learning?

Supervised learning uses labeled data; unsupervised learning uses unlabeled data.

Name a common metric used for evaluating model performance.

Accuracy, precision, recall, F1-score, or ROC-AUC.

What is overfitting, and how can it be prevented?

Overfitting occurs when a model learns noise instead of the underlying pattern; it can be prevented by using regularization techniques and cross-validation.

What are ensemble methods in machine learning?

Techniques that combine multiple models to improve overall accuracy (e.g., Random Forest, Boosting).

What is the purpose of dimensionality reduction?

To simplify models while retaining essential information, often using techniques like PCA.

How is Natural Language Processing (NLP) used in data science?

NLP techniques are used to process and analyze textual data.

What is the significance of model monitoring?

To evaluate models against real-world data and ensure continued accuracy.

Why is documentation important in model deployment?

It maintains clear records of model specifications, performance metrics, and changes made.

How can models be updated post-deployment?

By retraining models with new data to adapt to changing conditions.

What are the potential impacts of bias in data science?

Bias can lead to unfair or inaccurate model predictions affecting decisions and outcomes.

What measures can be taken to ensure data privacy?

Implementing security measures and complying with regulations like GDPR.

Why is transparency important in machine learning models?

It promotes user trust by clarifying algorithms and decision-making processes.

Supervised Learning

You know the independent (input) and dependent (output) variables.

Unsupervised Learning

You do NOT know the labels for output variables. The model identifies patterns or groupings on its own.

CRISP-DM

Cross-industry process for data mining. Steps: Data Understanding: Explore initial data, find quality issues, and insights. Data Prep: Clean, format, and transform data for modeling. Modeling: Select and apply appropriate algorithms. Know each stage and tasks involved.

Regression Evaluation Metrics

Common metrics include: P-Value: Measures variable significance. R-Squared: Proportion of variance explained by the model. Adjusted R-Squared: Adjusts R-Squared based on the number of predictors. Understand each metric's role in model evaluation.

Assumptions of MLR

Key assumptions include: Linearity, independence, homoscedasticity, and normality of residuals. Review each assumption for accurate model performance

Distance

Measures similarity or dissimilarity between data points, used in clustering and k-NN.

Probability

Likelihood of a particular outcome; foundational in predictive modeling and statistics.

Bayes Theory

Method to calculate conditional probability. Know basic formula and application

Naïve Bayes

Simplified Bayes theory assuming feature independence. Used for classification tasks.

Entropy

Measurement of data disorder, specifically for labels. Interpretation: How mixed positive and negative groups are in the dataset.

Information Gain

Reduction in entropy after splitting data; indicates improvement in data classification.

Data Prep Basics

Preprocessing steps: Convert data to correct type. Set target labels. Create indicator and ordinal variables. Address skewness. Standardize or normalize data (e.g., Min-Max, Z-score).

Data Leaks

Using data that wouldn't be available at prediction time, leading to overly optimistic models.

Accuracy

Ratio of correct predictions over total predictions. Formula: (True Positives + True Negatives) / Total Predictions.

Missing Values and how to handle

Types: Missing Completely at Random (MCAR): No pattern to missingness. Missing Not at Random (MNAR): Missingness related to unobserved data. Know strategies for each type.

Classification evaluation metrics

- accuracy - AUC - F-Score

confusion matrix

cross tab table that lists predictions and how many of each predictions and potential outcomes were right and wrong - should be sized n x n because you dont always have a binominal label

clustering

Grouping data points into clusters based on similarity; an unsupervised learning technique.

k-NN

Classification or regression based on the 'k' closest points (neighbors) to a target. k: one to whatever nn: nearest neighbors

K-means

Partitioning data into 'k' clusters with each point assigned to the nearest cluster centroid.

Decision Trees

Tree-structured classifiers that split data based on feature values to make predictions.

Correcting for scew/kurtosis

Log transformation, square root, or Box-Cox to normalize data distributions.

standardizing/normalizing data values

Adjust data to a common scale. Standardizing: Center data around 0 (Z-score). Normalizing: Scale data between a set range (e.g., 0-1, Min-Max).

feature selection

choosing which features help, and eliminating ones that hurt data

inductive logic

Inferring general patterns from specific observations (bottom-up approach)

deductive logic

Starts with general rules or observations to reach specific conclusions (top-down approach)

Test 2 Flashcards

(136 cards)