Definitions Flashcards

(62 cards)

1
Q

Machine learning

A

Field of study that gives computer the ability to learn without being explicitly programmed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Supervised Learning

A

A type of machine learning where the model learns from labeled data — each input has a known correct output — to make predictions on new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unsupervised Learning

A

A type of machine learning where the model learns from unlabeled data, finding hidden patterns or groupings without knowing the correct output in advance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Semi-Supervised Learning

A

A type of machine learning that uses a small amount of labeled data and a large amount of unlabeled data to improve learning accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Rainforcement learning

A

An agent learns to make decisions by interacting with an environment and receiving feedback through rewards or penalties

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Classification

A

A machine learning task where the model learns from labeled data to assign new inputs to predefined categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regression

A

A machine learning task where the model learns from labeled data to predict a continuous numerical value for new inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Clustering

A

The model detected many inputs and grouped them where similar inputs are placed together based on patterns in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Anomaly detection

A

Finding irregular data that doesn’t match regular patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Association Rule Learning

A

Finding patterns or rules that show how items are related to each other in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Batch Learning

A

A learning method where the model is trained on the entire dataset at once, usually offline, and updated only when retrained with new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Mini-Batch (Online) Learning

A

A method where the model is trained on small batches of data, allowing it to learn and update continuously as new data arrives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Dataset

A

A collection of data is treated as a single unit by a computer.
- Oxford dictionary.

A collection of data used for some specific machine learning purpose.
- the encyclopedia of machine learning (kakas, 2010).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Labels

A

The output variables (the target value) the model aims to predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Features

A

The input variables used by the model to make predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Training data

A

The dataset used to train a machine learning model, containing both features and labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Testing data

A

The dataset used to evaluate the performance of a trained machine learning model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Validation data

A

A dataset used to tune the model’s parameters and prevent overfitting during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

CSV Format:

A

A file format used to store data in a tabular form, where each row represents an instance and each column is a feature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Textual data

A

Data represented in the form of text, including words, sentences, or documents. It needs to be converted into numbers when working with machine learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Numerical data

A

Data that consists of numbers, used for mathematical calculations and statistical analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Bivariate datasets

A

Datasets with two variables, used to analyze the relationship between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Multivariate datasets

A

Datasets with more than two variables, used to analyze relationships among multiple variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Correlation Datasets

A

Datasets used to measure the relationship between two or more variables, where the variables must be numerical to calculate correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Time series data
A collection of a sequence values that all recorded and collected over specific intervals of time
26
Overfitting
When a machine learning model learns the details and noise in the training data to the extent that it negatively impacts its performance on new, unseen data.
27
Underfitting
When a machine learning model is too simple to capture the underlying patterns in the training data, leading to poor performance on both the training and testing data.
28
Regularization
A technique used to make a model simpler and reduce the risk of overfitting by adding a penalty to the model's complexity.
29
Imbalanced dataset
Datasets where the positive values of the datasat are not approximately the same as the negative values.
30
Resampling
One of the most common techniques that were proposed in order to solve imbalances datasets. Aims to handling the original dataset to make it more balanced
31
Oversampling
A technique used to increase the number of observations in the minority class to achieve balance, either by duplicating instances, or generating the new ones using various techniques
32
Undersampling
A technique used to balance an imbalanced dataset by reducing the number of observations in the majority class, either by removing instances or selecting a subset of the majority class using various techniques.
33
SMOTE
(synthetic minority oversampling technique) a technique focuses on increasing instances in the minority class to balance the dataset, usually by the applicating instances without adding new information
34
Outliers
A natural phenomena in a dataset. The art observations that differ significantly from the rest of the data, often appearing at the extremes
35
Irrelevant features
Features in a dataset that do not have any meaningful relationship with the target variable and do not contribute to improving the model's performance.
36
Feature Engineering
The process of selecting, modifying, or creating new features from raw data to improve the performance of a machine learning model.
37
Feature Selection
The process of choosing the most important features from a dataset, removing irrelevant or redundant ones, to improve model performance and reduce complexity.
38
Pre-training Feature Importance:
The process of assessing the importance of features before training the model, often using techniques like correlation analysis, to determine which features contribute most to predicting the target variable.
39
Post-training Feature Importance
The process of evaluating the importance of features after the model has been trained, using methods like model coefficients or feature importance scores, to understand which features most influence the model's predictions
40
Feature extraction
Combining existing features producing more useful ones (data mining or dimensionality reduction algorithms)
41
Decision tree
A machine learning model used in supervised learning, whether for classification or regression, that splits data into branches based on feature values to make predictions or decisions.
42
Support vector machine (SVM)
Regression approach finds the optimal decision boundary to separate different classes
43
Random Forest
A supervised learning model used for both classification and regression. It builds multiple decision trees during training and combines their predictions to improve accuracy and reduce overfitting.
44
Neural networks approach
A machine learning method that tries to mimic human thinking and cognition
45
K-nearest neighbor (KNN)
A machine learning method that assigns class based on closest neighbors
46
Permutation
A post-training feature importance technique where data is randomly shuffled, and the impact on model performance is measured to determine the importance of each feature.
47
Pandas library
A Python library used for data manipulation, cleaning, and analysis. It allows reading from and writing to various file formats such as CSV, Excel, SQL databases, and JSON.
48
.info() function
A function in Pandas that provides a summary of a DataFrame, including the number of non-null entries, data types of each column, and memory usage, helping to understand the structure of the dataset.
49
.head() Function:
A function in pandas that displays the first few rows of a DataFrame, helping you get a quick overview of the dataset. By default, it shows the first 5 rows.
50
.describe() function
A function in Pandas that generates summary statistics for numerical columns in a DataFrame, including count, mean, standard deviation, min, max, and quartiles, providing a quick statistical overview of the data.
51
.value_counts() function
A function in Pandas that returns the count of unique values in a column, sorted in descending order, providing insights into the distribution of categorical data.
52
.plot() Function
A function in Pandas used to create various types of plots (line, bar, histogram, etc.) directly from a DataFrame or Series, helping visualize the data.
53
Vectorization جات ف امتحان السنة اللي فاتت
Representing text as numerical vectors to make it usable for machine learning models.
54
One hot encoding
A method of converting categorical data into numerical vectors, where each category is represented by a vector with a single active element, called a "1," and the rest are zeros.
55
Bag of words
A method for representing text where each word is treated as a separate feature, and the frequency of each word in the document is counted, ignoring grammar and word order.
56
Term Frequency (TF)
A measure of how often a word appears in a document, usually calculated as the number of times the word appears divided by the total number of words in the document.
57
Inverse document frequency (IDF)
A measure used to evaluate how important a word is within the entire corpus. It is calculated by taking the logarithm of the total number of documents divided by the number of documents containing the word.
58
Decision node
A point in the decision tree where a condition is evaluated. Represents a feature in the dataset that is used to split data
59
Leaf node
The final output node that contains the predicted value or class label. Represents the conclusion of the decision process
60
Decision boundary (hyperplane)
A decision plane which separates between a set of object having different class membership
61
Support vectors
The data points, which are closest to the hyperplane. These points will define the separating glowing bitter by calculating margins.
62
Margin
The distance between the closest data points (support victors) and the decision boundary