Definitions Flashcards

Question

Time series data

Answer 1

A collection of a sequence values that all recorded and collected over specific intervals of time

Answer 2

When a machine learning model learns the details and noise in the training data to the extent that it negatively impacts its performance on new, unseen data.

Answer 3

When a machine learning model is too simple to capture the underlying patterns in the training data, leading to poor performance on both the training and testing data.

Answer 4

A technique used to make a model simpler and reduce the risk of overfitting by adding a penalty to the model's complexity.

Answer 5

Datasets where the positive values of the datasat are not approximately the same as the negative values.

Answer 6

One of the most common techniques that were proposed in order to solve imbalances datasets. Aims to handling the original dataset to make it more balanced

Answer 7

A technique used to increase the number of observations in the minority class to achieve balance, either by duplicating instances, or generating the new ones using various techniques

Answer 8

A technique used to balance an imbalanced dataset by reducing the number of observations in the majority class, either by removing instances or selecting a subset of the majority class using various techniques.

Answer 9

(synthetic minority oversampling technique) a technique focuses on increasing instances in the minority class to balance the dataset, usually by the applicating instances without adding new information

Answer 10

A natural phenomena in a dataset. The art observations that differ significantly from the rest of the data, often appearing at the extremes

Answer 11

Features in a dataset that do not have any meaningful relationship with the target variable and do not contribute to improving the model's performance.

Answer 12

The process of selecting, modifying, or creating new features from raw data to improve the performance of a machine learning model.

Answer 13

The process of choosing the most important features from a dataset, removing irrelevant or redundant ones, to improve model performance and reduce complexity.

Answer 14

The process of assessing the importance of features before training the model, often using techniques like correlation analysis, to determine which features contribute most to predicting the target variable.

Answer 15

The process of evaluating the importance of features after the model has been trained, using methods like model coefficients or feature importance scores, to understand which features most influence the model's predictions

Answer 16

Combining existing features producing more useful ones (data mining or dimensionality reduction algorithms)

Answer 17

A machine learning model used in supervised learning, whether for classification or regression, that splits data into branches based on feature values to make predictions or decisions.

Answer 18

Regression approach finds the optimal decision boundary to separate different classes

Answer 19

A supervised learning model used for both classification and regression. It builds multiple decision trees during training and combines their predictions to improve accuracy and reduce overfitting.

Answer 20

A machine learning method that tries to mimic human thinking and cognition

Answer 21

A machine learning method that assigns class based on closest neighbors

Answer 22

A post-training feature importance technique where data is randomly shuffled, and the impact on model performance is measured to determine the importance of each feature.

Answer 23

A Python library used for data manipulation, cleaning, and analysis. It allows reading from and writing to various file formats such as CSV, Excel, SQL databases, and JSON.

Answer 24

A function in Pandas that provides a summary of a DataFrame, including the number of non-null entries, data types of each column, and memory usage, helping to understand the structure of the dataset.

Answer 25

A function in pandas that displays the first few rows of a DataFrame, helping you get a quick overview of the dataset. By default, it shows the first 5 rows.

Answer 26

A function in Pandas that generates summary statistics for numerical columns in a DataFrame, including count, mean, standard deviation, min, max, and quartiles, providing a quick statistical overview of the data.

Answer 27

A function in Pandas that returns the count of unique values in a column, sorted in descending order, providing insights into the distribution of categorical data.

Answer 28

A function in Pandas used to create various types of plots (line, bar, histogram, etc.) directly from a DataFrame or Series, helping visualize the data.

Answer 29

Representing text as numerical vectors to make it usable for machine learning models.

Answer 30

A method of converting categorical data into numerical vectors, where each category is represented by a vector with a single active element, called a "1," and the rest are zeros.

Answer 31

A method for representing text where each word is treated as a separate feature, and the frequency of each word in the document is counted, ignoring grammar and word order.

Answer 32

A measure of how often a word appears in a document, usually calculated as the number of times the word appears divided by the total number of words in the document.

Answer 33

A measure used to evaluate how important a word is within the entire corpus. It is calculated by taking the logarithm of the total number of documents divided by the number of documents containing the word.

Answer 34

A point in the decision tree where a condition is evaluated. Represents a feature in the dataset that is used to split data

Answer 35

The final output node that contains the predicted value or class label. Represents the conclusion of the decision process

Answer 36

A decision plane which separates between a set of object having different class membership

Answer 37

The data points, which are closest to the hyperplane. These points will define the separating glowing bitter by calculating margins.

Answer 38

The distance between the closest data points (support victors) and the decision boundary

Definitions Flashcards

(62 cards)