General Knowlage (ML) Flashcards

1
Q

Active learning

A

A machine learning paradigm where a model is able to interactively query the user (or an oracle) to obtain labels for new data points. The key idea behind active learning is that the model can choose the most informative instances to query labels for, thereby maximizing the learning efficiency with fewer labeled examples.

Applied when obtaining labels is costly (for example requires medical expert opinion). We start with small number of labeled examples and large number of unlabeled ones, and then label only those examples that contribute the most to the model quality. We map the examples, we search for clusters around labeled features and we compute importance of unlabeled example to model. We then take the most important and only ask expert to label this examples. Then rebuild the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Association Rule Learning

A

A machine learning technique used to discover interesting relationships or associations between variables in large datasets. It aims to identify patterns such as frequent itemsets or rules that describe the co-occurrence of items or events. Common algorithms for association rule learning include Apriori and FP-Growth, which are widely used in market basket analysis, recommendation systems, and customer behavior analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Binary classification (binomial)

A

A type of classification task in supervised learning, where the goal is to categorize data points into one of two possible classes or outcomes (e.g., positive/negative, spam/not spam, present/absent).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Computational graph

A

A graphical representation of mathematical dependencies and operations or computations performed by a machine learning model. It consists of nodes representing mathematical operations and edges representing the flow of data between these operations. It’s essential for efficient calculation of gradients during backpropagation, the core algorithm for training neural networks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Convolution

A

Convolution is a mathematical operation that allows the merging of two sets of information. The result represents the amount of overlap between two functions. In the case of CNN, convolution is applied to the input data, to filter the information and it produces a feature map. This filter is also called a kernel, or feature detector.

(CNNs), convolutional layers use learned filters or kernels to extract features from input data. These filters slide over the input data, computing a dot product between the filter weights and the local regions of the input at each position. This process captures spatial hierarchies and patterns in the data, enabling CNNs to learn hierarchical representations and perform tasks such as image recognition, object detection, and natural language processing.

Convolution in CNNs plays a crucial role in feature extraction, where the learned filters act as feature detectors, detecting edges, textures, and other patterns in the input data. By stacking multiple convolutional layers and combining them with activation functions and pooling operations, CNNs can learn complex representations of the input data, making them powerful tools for various machine learning tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Deep learning

A

Training Neural Networks that have more than two non-output layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Deep models

A

Neural Networks that have 2 or more hidden layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Discriminative vs. Generative model

A

Generative models model the underlying probability distribution of the data. Generative models learn the joint probability distribution of both the input and output variables. They can generate new samples from the learned distribution.

Discriminative models directly model the decision boundary between different classes in the input space. They learn the conditional probability distribution of the output variables given the input variables. They trying to predict class of an example.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Eager

A

An approach where computations or actions are performed immediately, without delay or postponement. Eager evaluation involves eagerly executing statements or expressions as soon as they are encountered in the program flow. This contrasts with lazy evaluation, where computations are deferred until their results are explicitly needed. Eager execution is commonly used in imperative programming languages and eager loading strategies in database systems, where the goal is to eagerly retrieve and process data to improve responsiveness and efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Empirical risk

A

Empirical risk, also known as empirical loss or training loss, measures the average error of a model’s predictions on the training dataset. Empirical risk is typically calculated as the average of a loss function applied to the predictions made by the model on the training data. Common loss functions include mean squared error (MSE) for regression tasks and cross-entropy loss for classification tasks. It quantifies how well the model fits the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Ensemble models

A

Simple models may be too simple. NN need too much labeled data. But we can train many simple (weak) models and then combine them to obtain high-accuracy meta-model

Random Forest: A Random Forest is an ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Gradient Boosting Machines (GBM): GBM is a boosting technique where new models are added to correct the errors made by existing models. Each new model focuses on the examples that were misclassified by previous models.

AdaBoost (Adaptive Boosting): AdaBoost is a boosting algorithm that combines multiple weak classifiers to create a strong classifier. It works by iteratively training weak classifiers on various distributions of the data, with each subsequent classifier giving more weight to the examples that were misclassified by previous classifiers.

XGBoost (Extreme Gradient Boosting): XGBoost is an optimized implementation of gradient boosting. It is known for its efficiency, scalability, and performance. It incorporates additional features such as regularization to prevent overfitting.

Stacking: Stacking, also known as Stacked Generalization, involves training a meta-model that combines the predictions of multiple base models. Instead of directly averaging or voting on predictions, stacking learns how to best combine the predictions of the base models to make the final prediction.

Bagging (Bootstrap Aggregating): Bagging involves training multiple instances of the same base learning algorithm on different subsets of the training data, typically using bootstrapping, and then averaging the predictions to reduce variance and improve generalization.

Voting Classifier/Regressor: This technique combines the predictions from multiple different models (e.g., decision trees, support vector machines, logistic regression) and outputs the majority (for classification) or average (for regression) prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Feature

A

Features are individual measurable properties or characteristics extracted from raw data that are relevant for solving a particular task in machine learning. These could be numeric, categorical, or binary variables that represent different aspects of the data. Features are used as input variables for training machine learning models.

Features can be informative or not. they can be:
- discriminative (help distinguish between different classes or categories in the dataset)
- domain-specific
- Noise (do not contain any meaningful information and are purely random or noisy)
- redundant (highly correlated with other features in the dataset and do not provide additional information. Including redundant features can lead to overfitting and increased computational overhead. Redundant features often arise from feature transformations or combinations.)
- constant features (have the same value across all instances in the dataset. These features do not contribute any variability to the data and are thus irrelevant for modeling)

By the book we would devide features into:
- Numerical vs. Categorical
- Binary vs. Multi-valued Categorical
- Derived (Transformed and Engineered)
- Temporal
- Spatial
- Textual
- Image
- Meta-Features
- Composite

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Functional Programming (FP)

A

programming paradigm centered around the concept of functions as first-class citizens. In functional programming languages like Haskell, Lisp, and Clojure, functions are treated as values that can be passed as arguments to other functions, returned as results, or stored in data structures.

Key characteristics of functional programming include:

  • Immutability: Data is immutable, meaning that once defined, it cannot be changed. Functions operate on immutable data structures, ensuring referential transparency and avoiding side effects.
  • Higher-order Functions: Functions can take other functions as arguments or return functions as results. This allows for powerful abstractions and composition of functions.
  • Pure Functions: Functions are pure, meaning that they have no side effects and produce the same output for the same input. Pure functions are easier to reason about, test, and parallelize.

Functional programming promotes declarative and concise code, emphasizing the expression of computations as compositions of functions and transformations of data. It is particularly well-suited for parallel and distributed computing, concurrency, and building scalable and maintainable software systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Greedy

A

A greedy algorithm is a strategy for solving problems where you make a series of choices, each time selecting the option that seems best at that moment. It focuses on short-term gains, hoping the final accumulation of these local decisions will lead to a good overall solution.

While greedy algorithms are simple and efficient, they may not always yield the optimal solution. They tend to find local minima. In complex optimization problems where global optimization is required they may fail. In other words: while they might find a decent solution quickly, there’s no guarantee it will be the best possible solution.

Greedy strategies are commonly used in algorithms such as RFE(REcursive feature elimination), greedy search, greedy heuristics, and greedy algorithms for solving problems like the Knapsack problem, Minimum Spanning Tree, and Shortest Path problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Image Captioning

A

Image captioning is the task of automatically generating a natural language description (a caption) of an image. It combines CNN with NLP. The model is trained on a large dataset of image-caption pairs. The goal is to learn how to generate captions that accurately and fluently describe the visual content of the images. Image Captioning is often done by Encoder-Decoder models. Here is how they work:

  1. The Encoder: Extracting Visual Information
    CNN Pre-trained on ImageNet: A powerful Convolutional Neural Network (often ResNet or VGG) pre-trained on a massive image classification task (like ImageNet) is utilized.
    Removing the Final Layer: The last classification layer of the CNN is removed, allowing it to function as a rich feature extractor.
    Image to Feature Vector: The CNN takes your input image and transforms it into a dense feature vector that encapsulates the visual essence of the image.
  2. The Decoder: Translating Features into Language
    LSTM as Word Generator: A Long Short-Term Memory (LSTM) network, a type of RNN, is tasked with generating the caption word by word.
    Start Token: A special “<start>" token signals the beginning of the caption.
    Step-by-Step Generation:
    - The LSTM takes the image's feature vector and the previously generated word as input.
    - It predicts probabilities for the next word in the vocabulary.
    - A word is sampled from this probability distribution (or the word with the highest probability is chosen).
    - The process repeats until an "<end>" token is generated or a max caption length is reached.</end></start>
  3. Key Refinement: Attention
    An attention mechanism allows the LSTM decoder to selectively focus on specific regions or features of the image while generating each word. This dynamic attention helps the model generate more contextually relevant and descriptive captions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Inference

A

Inferences are steps in reasoning, moving from specific premises to logical consequences and broader generalizations. They are probabilistic in nature. Conclusions are likely, based on evidence, but not necessarily certain. (deduction is similar but deals with certanities)

Inference is the process of applying a trained machine learning model to new, unseen data to make prediction and draw insights. It usually happenes in deployment state.

Types of Inference:
Batch Inference: Making predictions on a group of data points all at once. This is often more efficient computationally.
Real-time Inference: Making predictions on individual data points immediately as they become available (e.g., fraud detection systems).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Instance based learning alghoritm

A

Instance-based learning algorithms, also known as lazy learning algorithms, make predictions based on the similarity of input instances to instances in the training dataset. These algorithms do not explicitly build a model during training; instead, they store the training instances and use them for making predictions at runtime. Examples include k-nearest neighbors (KNN) and kernel density estimation (KDE).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Integral image concept

A

A technique used for fast computation of rectangular features in object detection tasks, particularly in Viola-Jones face detection algorithm. It involves calculating the sum of pixel intensities within rectangular regions of an image to efficiently compute features used for classification without the need for repeated calculations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Meta Learning

A

Also known as “learning to learn,” refers to a subfield of machine learning focused on understanding and developing algorithms capable of learning new tasks or adapting to new environments rapidly and efficiently. Unlike traditional machine learning approaches that focus on learning from a fixed dataset to solve a specific task, meta learning aims to enable models to learn from a variety of tasks or experiences and generalize that knowledge to new tasks or domains. Meta learning algorithms often involve training a meta-learner on a distribution of tasks or datasets, allowing it to extract common patterns or principles that can be applied to unseen tasks. Meta learning has applications in few-shot learning, transfer learning, and reinforcement learning, among others, and it holds promise for enabling AI systems to learn more autonomously and adaptively in diverse and dynamic environments.

Meta-learning systems are trained on a large and diverse collection of tasks. Each task often has a small amount of data associated with it. The meta-learner extracts patterns on how to approach different types of problems. It learns a strategy to update its own parameters rapidly when given a new task. The idea is to learn a similarity metric between data points. A new task is then solved by comparing new data points to examples from the training tasks using this learned metric (similar to k-nearest neighbors). the meta-learner is a recurrent neural network (RNN) or similar architecture with internal “memory.” It’s trained to update its own parameters quickly to adapt to a new task, utilizing its past experiences. The focus is on learning a good initialization point for the model’s parameters and/or designing an optimization algorithm that quickly converges to a solution on new tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Multi-Label Classification

A

In most classification tasks, each data point belongs to one and only one class (e.g., an image is either a “cat” or a “dog”, but not both). In multi-label classification, a single data point can be associated with multiple labels simultaneously. Traditional classifiers predict a single class. Multi-label classifiers predict a set of relevant labels. Multi-label problems are more complex because you need to model potential correlations between labels. A cat and a dog in an image are likely, but certain disease combinations might be rare. Single-Label are Focus on boundaries between classes. Multi-Label must also consider relationships between possible labels (some labels might co-occur frequently, others might be mutually exclusive).

Approaches
1. Problem Transformation: The core idea behind problem transformation is to convert a multi-label classification problem into a format that traditional single-label classification algorithms can understand.
* Binary Relevance: Train a separate binary classifier for each possible label. Treats each possible label as a separate yes/no classification problem. For a dataset with ‘N’ possible labels, you would train ‘N’ independent binary classifiers. To classify a new data point, you run it through each of the binary classifiers. Any classifier that outputs “yes” (or a probability above a threshold) gives you a label. It is easy to implement using existing single-label classifiers. Training the binary classifiers can be done in parallel. The biggest drawback – it treats each label in complete isolation, missing out on potentially helpful relationships between labels.
* Label Powerset: Create a new class for every possible label combination. Explicitly models every possible combination of labels as a unique class. You train a single (now traditional) multi-class classifier on this transformed problem. It can capture potential correlations between labels. The number of new classes grows exponentially with the number of labels. This quickly becomes computationally demanding and can lead to data sparsity (few examples for each combination). Works decently when labels are mostly independent, or when computational efficiency is paramount. Useful for problems with a small number of possible labels and where correlations between those labels are crucial.

  1. Specialized Algorithms:
    • Classifier Chains: Train classifiers sequentially, where each classifier’s predictions are used as features for the next one in the chain. Involves training a series of binary classifiers, but unlike binary relevance, they are linked together. The first classifier is trained just on the input features. The second classifier is trained on the input features AND the predictions made by the first classifier. This continues for each subsequent classifier, building a chain where previous predictions become inputs. The idea is to gradually learn dependencies between labels. Labels that frequently occur together should influence each other within the chain. The core advantage over binary relevance is modeling some degree of relationship between labels. The order of labels in the chain can be determined strategically for potential performance gains. Mistakes made early in the chain can cascade down and negatively affect the predictions of later classifiers. Performance can be affected by the chosen label order in the chain.
    • ML-KNN (Multi-Label K-Nearest Neighbors): Find the ‘K’ nearest neighbors of a new data point. Examine the label sets associated with those neighbors. Determine relevant labels for the new point based on statistics about the neighbors’ labels (e.g., frequency, average probabilities). Easy to grasp and explain. Doesn’t make strong assumptions about the underlying data distribution. Finding nearest neighbors, especially in large datasets, can be slow. Outliers or incorrect labels in the training data can throw off its predictions.
    • Deep Learning Methods: Adapted neural network architectures for handling multiple outputs.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Multiclass classification (multinomial)

A

Multiclass classification is a type of supervised learning task where the goal is to classify input data into one of three or more classes or categories. It involves predicting a single output variable with multiple possible discrete values. This classification method produces a range of probabilities for each class and selects the most probable class as the final prediction. Multinomial logistic regression and softmax activation in neural networks are common approaches for handling multiclass classification tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Multitask learning

A

Multitask learning (MTL) is a machine learning paradigm where a model is trained to perform multiple tasks simultaneously, leveraging shared information across tasks to improve generalization performance. In MTL, the model is trained on a joint objective that combines the loss functions of all tasks, encouraging the model to learn representations that are beneficial for all tasks. By sharing knowledge across related tasks, multitask learning can lead to better generalization, especially when individual tasks have limited amounts of training data or when tasks are related in some way (e.g., semantic similarity or shared underlying structure). Multitask learning has applications in various domains, including natural language processing, computer vision, and healthcare, where different tasks may benefit from leveraging common features or learning from related data sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Nodes

A

Node is the same as Neuron. However because neuron’s name was inspired by real neurons and because they dont really work like neurons we use “node” to highlight that Nodes work differently to neurons

“A neuron is the most basic processing unit within an artificial neural network. The concept of artificial neurons in neural networks is loosely inspired by biological neurons in the brain. Biological neurons receive signals (inputs) through connections called dendrites, process them, and send an output signal through the axon if a certain threshold is met.

Neural networks learn by adjusting the weights and biases during training. The goal is to find the optimal values that produce the desired output given a specific input. Artificial neural networks are organized into layers: an input layer, one or more hidden layers, and an output layer. Neurons in one layer are connected to neurons in the next, creating a complex network of calculations.

In a neural network, a neuron is a mathematical function that performs the following:
1) Inputs: A neuron receives multiple input values. These inputs could come from raw data (e.g., pixel values of an image) or be the outputs of neurons from a previous layer in the neural network.
2) Weights: Each input is multiplied by a corresponding weight. Weights are like knobs that determine how much influence each input has on the neuron’s output.
3) Summation: The weighted inputs are summed together.
4) Bias: A bias term is added to the sum. The bias is like an adjustment that helps the neuron learn how much we want to activate this neuron
5) Activation Function: The result of the summation (and bias) is passed through a non-linear activation function. This function introduces non-linearity into the model, which is essential for neural networks to learn complex patterns. Common activation functions include:
- Sigmoid
- Tanh
- ReLU (Rectified Linear Unit)
6) Output: The output of the activation function is the final output of the neuron. This output can then be sent to neurons in the next layer of the neural network.

Simple Analogy
Imagine a neuron like a decision-maker. Consider the decision of whether to wear a coat outside:

1) Inputs: Temperature, wind speed, likelihood of rain.
2) Weights: How heavily you weigh each factor (you might care more about temperature than wind, etc.)
3) Bias: Your general predisposition towards wearing a coat (some people are more likely to get cold).
4) Activation Function: Your mental model deciding if the combined factors cross a threshold for putting on a coat.
5) Output: The decision – coat or no coat.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Non-parametric model

A

A type of statistical model that does not make explicit assumptions about the functional form or distribution of the underlying data. Instead of estimating fixed parameters, nonparametric models estimate the underlying data distribution directly from the observed data.

Examples include k-nearest neighbors (KNN), decision trees, and kernel density estimation (KDE). Nonparametric models are flexible and can capture complex relationships in the data without making strong assumptions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Non-sequential models

A

Non-sequential models refer to machine learning models that do not inherently rely on sequential data or temporal dependencies. Unlike sequential models such as recurrent neural networks (RNNs) or transformers, which are designed to process sequential data with explicit temporal order, non-sequential models can handle input data without assuming any specific order or sequence. Examples of non-sequential models include feedforward neural networks (e.g., multilayer perceptrons), convolutional neural networks (CNNs), and graph neural networks (GNNs). Non-sequential models are commonly used in tasks such as image classification, object detection, and graph analytics, where the input data does not have an inherent temporal or sequential structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Numerical overflow

A

Occurs when a numerical computation results in a value that exceeds the maximum representable value for a numeric data type. This can lead to inaccuracies or errors in computations and is particularly common in floating-point arithmetic. Overflow can occur due to large intermediate results or excessively large inputs.

That is why most computers use exponential to denote large numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Objective function

A

Also known as a loss function or cost function, measures how well a model’s predictions match the true values in the training data. It quantifies the discrepancy between predicted and actual values and is used to optimize model parameters during training. The goal is to minimize the value of the objective function to improve the model’s performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Parallelize

A

Parallelization in machine learning means splitting computationally intensive tasks across multiple processors or cores to speed up model training or processing. This can involve parallelizing data processing (e.g., loading and transforming data simultaneously on multiple cores), parallelizing the calculations within model algorithms (e.g., matrix operations in neural networks spread across multiple GPUs), or parallelizing multiple model training experiments (e.g., hyperparameter tuning with different configurations running concurrently). The primary goal is to reduce the time it takes to learn from data, allowing for faster experimentation, handling larger datasets, or exploring more complex models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Perceptron

A

A perceptron is a specific type of node in a neural network, specifically in the context of a single-layer neural network. It takes multiple input values, each multiplied by a weight, and produces a single output based on a threshold activation function. The perceptron was one of the earliest models of artificial neurons proposed by Frank Rosenblatt in 1957. Mathematically, the output of a perceptron can be represented as the sign of the dot product of the input vector and the weight vector plus the bias term. Perceptrons can only learn linearly separable functions.

In summary, while a perceptron is a type of node, it has specific characteristics and limitations compared to more complex nodes used in modern neural networks, such as those with non-linear activation functions and multiple layers (multilayer perceptrons or deep neural networks).

A perceptron is a single computational unit within a neural network. It’s inspired by the idea of a biological neuron but heavily simplified.

a) Perceptron takes multiple numerical inputs (x1, x2, … xn).
b) Calculates a weighted sum of its inputs (each input multiplied by a corresponding weight).
c) Adds a bias term to shift the result.
d) Applies an activation function to this sum (usually step function), producing an output (often a binary 0 or 1).

Perceptrons are classic building blocks of neural networks, however, modern networks often use nodes with more complex activation functions beyond a simple step.

Think of it like squares and rectangles: all perceptrons are nodes (like all squares are rectangles), but not all nodes are perceptrons (rectangles can have different proportions). A node is a general term for a computational unit in a graph-like structure. A perceptron is a specific node type designed for simple binary classification tasks.

A node is a generic element within a neural network or other computational graphs. It represents a unit of computation. Nodes can perform various functions: Simple input or output placeholders, Hold parameters (weights and biases), Apply activation functions, Implement more complex mathematical operations

A perceptron is a particular kind of node with a well-defined structure and function: Takes multiple inputs, Calculates a weighted sum + bias, Applies a step activation function (typically) Perceptrons are designed to act as extremely simple binary classifiers.

30
Q

Pipeline (definition)

A

A structured sequence of technical steps that transform raw data into actionable insights or predictions.

31
Q

Pipeline

A
  1. Problem Definition and Data Collection
    a) Problem Definition:
    • Clearly define the business or research problem you want to solve.
    • Determine whether machine learning is the appropriate solution.
    • Identify if it’s a supervised, unsupervised, or reinforcement learning problem.
      b) Data Collection:
    • Gather relevant data from various sources (databases, APIs, web scraping, etc.).
    • Determine if you have sufficient data to proceed.
    • If needed, design data collection strategies (surveys, experiments).
  2. Data Preprocessing & Exploration
    a) Data Cleaning:
    • Handle missing values (remove, impute).
    • Address outliers and inconsistent data.
    • Format data appropriately for modeling (numeric values, etc.).
    • Resampling
    • Reshaping
      b) Exploratory Data Analysis (EDA):
    • Use statistical summaries and visualizations to understand the data.
    • Identify patterns, trends, correlations, and potential issues.
      c) Balancing imbalanced dataset
      d) Feature Engineering:
    • Create new features from existing data that help models learn better.
    • Select the most important features. Dimensionality reduction
    • Transform features into the correct formats (scaling, encoding).
  3. Model Development
    a) Model Selection:
    • Choose an appropriate algorithm based on the task (classification, regression, clustering, etc.) and the data’s characteristics.
    • Start with simple models (e.g., linear regression, decision trees) and consider more complex ones (e.g., neural networks) if needed.
      b) Feature Selection
    • Select features apropriate for model
      c) Model Training:
    • Split your data into training and testing sets.
    • Train the model on the training set to learn patterns.
    • Use techniques like cross-validation and regularisation to prevent overfitting.
      d) Hyperparameter Tuning:
    • Experiment with different settings (learning rate, number of layers in a neural network, etc.) to optimize performance.
  4. Model Evaluation
    a) Evaluation Metrics:
    • Choose appropriate metrics based on the problem (accuracy, precision, recall, F1-score, etc.).
      b) Model Testing:
    • Evaluate performance on the unseen testing set to get a realistic assessment.
      c) Error Analysis:
    • Examine patterns in errors to identify areas for improvement. Investigate whether it’s an issue of bias, variance, or the model’s limitations.
32
Q

Dimensionality Reduction Techniques

A

Principal Component Analysis (PCA):
Finds new dimensions (principal components) that maximize the variance captured from the original data.
Linear technique commonly used for visualization and preprocessing.

Linear Discriminant Analysis (LDA):
Similar to PCA, but focuses on finding dimensions that maximize the separation between different classes (used for supervised learning problems).

t-Distributed Stochastic Neighbor Embedding (t-SNE):
Great for visualization. Preserves the local structure of data, creating clusters in a lower-dimensional space.
Nonlinear technique, not ideal if you later need the reduced features as input to a model.

Autoencoders:
We discussed these earlier, but a key use case is for dimensionality reduction. The compressed code in the latent space acts as a lower-dimensional representation.

Feature Selection:
Identifying and selecting the most relevant subset of existing features. Methods include filter-based, wrapper-based, and embedded feature selection.

33
Q

Pure subset

A

In decision tree analysis, a pure subset refers to a group of data samples (leaf node in the tree) where all the samples belong to the same class, and the predicted outcome has no contradicting cases. This means that if a new data sample falls into a pure subset, the decision tree will always predict the correct class for that sample. Pure subsets are desirable in decision trees because they indicate clear patterns and strong decision-making ability, leading to higher model accuracy.

34
Q

Regression

A

Regression is a problem of predicting a real-valued label (often called a target) given an unlabeled example. Estimating house price valuation based on house features, such as area, the number of bedrooms, location and so on is a famous example of regression.
The regression problem is solved by a regression learning algorithm that takes a collection of labeled examples as inputs and produces a model that can take an unlabeled example as input and output a target.

35
Q

Reinforcement Learning

A

A type of machine learning where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards over time. It involves learning a policy or strategy for selecting actions based on observed states and received rewards. Reinforcement learning is commonly used in scenarios with sequential decision-making and sparse feedback.

36
Q

Shallow models

A

A shallow learning algorithm learns the parameters of the model directly from the features of the training examples (without intermidiate layers of operations on data) . Most supervised learning algorithms are shallow. A machine learning models with a limited number of layers or computational depth. These models typically consist of a single layer (e.g., logistic regression, perceptron) or a small number of layers (e.g., shallow neural networks). Shallow models are computationally efficient and often suitable for simple tasks but may lack the capacity to learn complex patterns present in data.

37
Q

Supervised Learning

A

A machine learning paradigm where the model is trained on labeled data, consisting of input-output pairs. The goal is to learn a mapping from input features to output labels or target variables by minimizing a predefined loss or error function. Supervised learning tasks include regression (predicting continuous values) and classification (predicting categorical labels).

38
Q

Tensor

A

A tensor is an array: that is, a data structure that stores a collection of numbers that are accessible individually using an index, and that can be indexed with multiple indices.

39
Q

Time series concepts: (Random events (irregular), Trend, Seasonality, Cycle)

A
  • Random Events (Irregular): Unpredictable fluctuations or noise in the time series data that do not follow any discernible pattern.
  • Trend: The long-term movement or directionality of the time series, indicating overall growth, decline, or stability.
  • Seasonality: Periodic patterns or fluctuations in the time series that occur at fixed intervals, such as daily, weekly, monthly, or yearly cycles.
  • Cycle: Longer-term fluctuations or patterns in the time series that repeat irregularly, often over multiple years, and may be influenced by economic or environmental factors.
40
Q

Transfer Learning

A

Used in machine learning, means reusing a pre-trained model on a new problem. In transfer learning, a machine exploits the knowledge gained from a previous task to improve generalization about another.

  1. Pre-trained Model: A model is trained on a large dataset for a specific task. This model can be very complex, requiring a significant amount of data and computational resources.
  2. Feature Extraction: The pre-trained model learns low-level features that are generally useful for many tasks. These features might capture basic shapes, edges, or color patterns in the case of image recognition models.
  3. Transferring Knowledge: For a new task, the pre-trained model serves as a starting point. Often, the initial layers responsible for extracting these low-level features are kept frozen (their weights don’t change).
  4. Fine-tuning: New layers are added on top of the pre-trained model. These new layers are specifically designed for the new task and are trained on a smaller dataset relevant to that task.
41
Q

Types of tasks in ML

A

Some tasks can blend techniques (e.g., using classification results for a recommender system). List is not exhoustive:.

Supervised Learning
Classification: Predicting a discrete class label for an input.
Techniques: Logistic Regression, Decision Trees, Support Vector Machines (SVMs), Neural Networks, Random Forests.
Regression: Predicting a continuous numerical value.
Techniques: Linear Regression, Polynomial Regression, Lasso/Ridge Regression, Support Vector Regression (SVR), Neural Networks.

Unsupervised Learning
Clustering: Grouping data points into clusters based on similarity.
Techniques: K-means, Hierarchical Clustering, Density-Based Clustering (DBSCAN)
Dimensionality Reduction: Finding lower-dimensional representations of data that capture essential information.
Techniques: Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), Autoencoders
Semi-Supervised Learning: Utilizes both labeled and unlabeled data for training, especially when labeled data is limited.
* Techniques: Self-training, Generative Adversarial Networks (GANs)

Reinforcement Learning: Learning optimal action policies in an environment through trial-and-error interactions.
* Techniques: Q-Learning, Deep Q-Networks (DQN), Policy Gradient methods

Specialized Tasks
Natural Language Processing (NLP):
Text classification: Sentiment analysis, spam detection, topic modeling.
Machine translation: Translating between languages.
Text Summarization: Generating concise summaries of documents.
Question Answering: Building systems that can answer questions based on a given text.

Computer Vision:
Image Classification: Assigning labels to images (cat, dog, etc.).
Object Detection: Localizing and classifying objects within images.
Image Segmentation: Dividing images into regions of interest.

Time Series Analysis and Forecasting:
Forecasting future values: Predicting stock prices, weather patterns, etc.
Anomaly Detection Identifying unusual patterns in time series data.

Recommender Systems:
Collaborative Filtering: Recommending items based on the preferences of similar users or items.
Content-Based Filtering: Recommending items with features similar to those a user has liked in the past.

Generative Models:
Image Generation: Creating realistic images (GANs, Variational Autoencoders).
Text Generation: Generating human-like text.

42
Q

Univaried

A

In ML, “univariate” refers to data or analysis that involves a single variable or feature. A dataset is univariate if it contains only one feature that you’re interested in. Example: A list of house prices in a particular neighborhood. Some machine learning models are inherently univariate, like simple linear regression where you’re predicting one target based on a single other feature.

Univariate is an opposite of Multivariate. This refers to data with multiple features or variables. Most real-world ML problems are multivariate. Example: predicting house prices using features like square footage, number of bedrooms, location, etc.

43
Q

Unsupervised Learning

A

A machine learning paradigm where the model is trained on unlabeled data without explicit input-output pairs. The goal is to discover underlying patterns, structures, or relationships in the data. Unsupervised learning tasks include clustering (grouping similar data points), dimensionality reduction (reducing the number of input features), and generative modeling (learning the data distribution).

44
Q

Vanillia

A

The term “vanilla” in machine learning often refers to the simplest or most basic version of an algorithm, without any modifications or enhancements.

45
Q

Workflow

A

A structured sequence of broad-ranging steps that acomplish the task.

  • Data collection and preparation
  • Model development (experimentation, selection, training)
  • Evaluation and validation
  • Deployment into a production environment
  • Monitoring model performance and retraining
46
Q

Zero-shot learning

A

Zero-shot learning (ZSL) is a machine learning paradigm where a model is asked to classify or make predictions about classes it has never seen any labeled examples of during training. This is possible by leveraging prior knowledge about the world, often in the form of word embeddings or other semantic representations. For example, a ZSL model might be trained on images of different animals. Even if it’s never seen a “zebra,” when presented with a zebra image and the description “striped horse-like animal,” it might correctly classify it because it can combine its knowledge of existing classes and the provided description.

ZSL fundamentally relies on the idea that even unseen classes can be described in terms of attributes or features shared with known classes. For example, both a “horse” and a “zebra” have attributes like “four legs,” “hooves,” “mane,” etc. A ZSL model learns to map both images and class descriptions (even descriptions of unseen classes) into this shared attribute space.

Training: The model learns a mapping between images and their corresponding attribute descriptions for a set of known classes. It also learns a similar mapping between class names and their attribute descriptions.
Inference: When presented with a new image and asked to classify it, the model predicts the image’s attributes. It then compares these predicted attributes to the attribute descriptions of all classes (including those not seen in training). The class whose description best matches the predicted attributes is chosen as the label.

47
Q

Batches

A

Batches refer to subsets of a dataset that are processed together during training in machine learning models. Splitting data into batches allows for efficient training by reducing memory usage and enabling parallel processing.

48
Q

Classification

A

The task of assigning data points to predetermined categories based on learned patterns. Models are trained on data with known labels to predict the labels of new, unseen data.

Classification is a problem of automatically assigning a label to an unlabeled example. Spam detection is a famous example of classification.
In machine learning, the classification problem is solved by a classification learning algorithm that takes a collection of labeled examples as inputs and produces a model that can take an unlabeled example as input and either directly output a label or output a number that can be used by the analyst to deduce the label. An example of such a number is a probability.
In a classification problem, a label is a member of a finite set of classes. If the size of the set of classes is two (“sick”/“healthy”, “spam”/“not_spam”), we talk about binary classification (also called binomial in some sources). Multiclass classification (also called multinomial) is a classification problem with three or more classes

49
Q

Clustering Algorithms

A

Unsupervised learning techniques used to group similar data points into clusters in machine learning and artificial intelligence. These algorithms partition data based on similarity or distance metrics, such as K-means, hierarchical clustering, DBSCAN, and Gaussian mixture models. Clustering algorithms find applications in exploratory data analysis, customer segmentation, anomaly detection, and dimensionality reduction.

50
Q

Epochs

A

Epoch represents one complete pass of the entire training dataset through the learning algorithm. It’s like showing your student the entire textbook once.

Why Epochs Matter:
Gradual Learning: Multiple epochs are typically required for a model to learn effectively. Each epoch builds upon the knowledge gained from the previous one.
Balancing Act: Too few epochs might result in underfitting (the model doesn’t learn enough), while too many epochs could lead to overfitting (the model memorizes the training data but doesn’t generalize well to unseen examples).
Key points to remember:

One epoch is one complete pass through the entire training dataset. Each epoch involves updating the model’s parameters based on its performance on individual training examples. The number of epochs is a hyperparameter that needs to be tuned for optimal performance.

51
Q

Hyperparameters

A

Hyperparameters are configuration settings that are external to the model and are not learned from the data during training. They govern the learning process and affect the performance of the model. Examples include learning rate, regularization strength, and the number of hidden layers in a neural network.

A hyperparameter is a property of a learning algorithm, usually (but not always) having a numerical value. That value influences the way the algorithm works. Hyperparameters aren’t learned by the algorithm itself from data. They have to be set by the data analyst before running the algorithm.

52
Q

Layer

A

Layer refers to a collection of neurons (nodes) that perform a specific computation or transformation on input data. Neural networks typically consist of multiple layers organized in a hierarchical fashion, with each layer responsible for extracting different levels of abstraction from the input data. Common types of layers include input layer, hidden layers, and output layer.

Core Layers:
Dense (Fully Connected) Layers: The workhorse of many neural networks. Every neuron in a dense layer is connected to every neuron in the previous layer. Used for learning complex relationships between input features and for outputting final predictions.

Convolutional Layers: Designed to extract local patterns from data, especially images. They apply small filters that slide over the input, detecting features like edges and textures. Crucial for computer vision tasks.

Recurrent Layers (LSTM, GRU): Specialized for handling sequential data like text or time series. They maintain an internal memory to “remember” information from previous elements in the sequence, making them excellent for language modeling and tasks with temporal dependencies.

Normalization Layers:
Batch Normalization: Helps stabilize and speed up training by normalizing the activations of a layer across a batch of data. Reduces sensitivity to initialization and allows for higher learning rates.

Layer Normalization: Similar to batch normalization, but normalizes across the features within a single example, helpful for specific tasks like natural language processing.

Activation Layers:
ReLU (Rectified Linear Unit): Very popular due to its simplicity and ability to prevent the vanishing gradient problem. It simply outputs the input if it’s positive, otherwise outputs zero.

Sigmoid: Maps input values between 0 and 1, often used for output layers in binary classification problems (predicting probabilities).

Tanh: Similar to Sigmoid, but maps inputs between -1 and 1, sometimes helpful for certain tasks.

Pooling Layers:
Max Pooling: Downsamples feature maps by taking the maximum value within a sliding window, reducing dimensionality and making the network more robust to small data variations.

Average Pooling: Similar to max pooling, but takes the average within the window.

Other Specialized Layers:
Dropout: A regularization technique that randomly drops neurons during training to prevent overfitting.

Attention Mechanisms: Used in transformer architectures to allow the model to focus on important parts of the input sequence, crucial for advanced natural language processing tasks.

53
Q

Model

A

A mathematical representation of a system or process that learns patterns and relationships from data. It maps input features to output predictions and is trained using algorithms to optimize its performance on a specific task. Models can range from simple linear regression models to complex deep learning architectures.

54
Q

Neural Network

A

A neural network is a computational model inspired by the structure and function of the human brain. It consists of interconnected nodes (neurons) organized in layers. Neural networks can learn complex patterns and representations from data through training, making them effective for tasks such as classification, regression, and pattern recognition.

55
Q

PAC (Probably Approximately Correct) learning

A

A theoretical framework in machine learning that provides guarantees on the learnability of concepts from data. It defines conditions under which a learning algorithm can produce a hypothesis that is probably approximately correct with high confidence. PAC learning bounds the error of the learned hypothesis in terms of its performance on a sample of data drawn from the underlying distribution.

56
Q

Parameters

A

Parameters are variables that define the model learned by the learning algorithm. Parameters are directly modified by the learning algorithm based on the training data. The goal of learning is to find such values of parameters that make the model optimal in a certain sense.

In the context of machine learning models, parameters are the internal variables that the model learns from the training data. They represent the adjustable components of the model that determine its behavior and are typically optimized during the training process to minimize the objective function. Examples of parameters include weights in a neural network, coefficients in a linear regression model, and split points in a decision tree.

  • Weights: Numerical values associated with each input feature or connection between neurons in a neural network. They determine the influence of a feature on the model’s output.
  • Biases: Constant values added to the weighted sum of inputs before applying an activation function (especially in neural networks). They help shift the activation function to adjust the model’s output
57
Q

Parametric model

A

A type of statistical model that makes strong assumptions about the functional form or distribution of the underlying data. It is characterized by a fixed number of parameters that are determined independently of the size of the training data. Parametric models are often simpler and more computationally efficient than non-parametric models but may make restrictive assumptions that limit their flexibility and applicability to different types of data. Examples include linear regression, logistic regression, and Gaussian Naive Bayes.

58
Q

Policy (in reinforcement learning)

A

A policy defines the agent’s strategy or behavior for selecting actions in a given state to maximize cumulative rewards. It maps states to actions and guides the agent’s decision-making process. Policies can be deterministic or stochastic, representing either a fixed action or a distribution over possible actions, respectively.

59
Q

Receptron

A

Regression is a supervised learning task where the goal is to predict continuous or real-valued output variables based on input features. It involves modeling the relationship between independent variables and dependent variables using statistical techniques. Common types of regression include linear regression, polynomial regression, and logistic regression (for binary classification).

60
Q

Semi-Supervised Learning

A

A machine learning paradigm that combines labeled and unlabeled data for training predictive models. It leverages the availability of a small amount of labeled data and a larger pool of unlabeled data to improve model performance. Semi-supervised learning algorithms aim to exploit the underlying structure or relationships in the data to make predictions.

61
Q

Updating weights and biases

A
  1. . Neural Network Fundamentals
    Data flows through the network. Each neuron calculates a weighted sum of its inputs, adds its bias, and applies an activation function to produce an output. The difference between the network’s output and the true target value is calculated using a loss function. The goal is to find weights and biases that minimize error (the difference between predicted and true targets).
  2. . Backpropagation and Gradient Descent
    When we get the error, it is “backpropagated” through the network. The partial derivative of the loss function with respect to each weight and bias is calculated (this is how much changing that weight/bias impacts the total error). Backpropagation, the algorithm that updates weights in neural networks, is built on the chain rule. It allows us to break down how a change in a weight far back in the network ultimately affects the final output error. Gradients tell you the direction to change weights/biases to reduce error. Updates are proportional to the negative of the gradient (going downhill to reduce error).
  3. Understanding Partial Derivatives
    The core idea is that when we take a partial derivative with respect to one variable, we temporarily pretend all other variables are fixed numbers. In practice, we calculate partial derivatives using the same differentiation rules you use for ordinary single-variable derivatives. The trick is to treat other variables as if they were ordinary numbers. Imagine a multidimensional landscape (like a hilly terrain) representing your function. When taking a partial derivative with respect to ‘x’ we essentially slice a cross-section of this landscape only in the ‘x’ direction. We study how the function changes along that slice.
  4. The Update Process
    * Batching: The weights are usually updated after calculating the gradients over a batch of examples, for better stability and efficiency.
    * Update Formula: New Weight = Old Weight - (Learning Rate * Gradient)
    * Learning Rate’s Role: Raw gradients can vary wildly in magnitude across different layers of the network. The learning rate helps normalize updates. Even if the gradient’s direction is correct, moving too much in that direction might land us in an even worse place in terms of error.
  5. . Example
    Imagine a neuron with two inputs (x1, x2), weights (w1, w2), bias (b), and output (y):
    y = activation(w1x1 + w2x2 + b)
    Let’s say the true target is ‘t’, and the error is (y-t)^2
    Gradients:
    • dError/dw1 = (y-t) * x1
    • dError/dw2 = (y-t) * x2
    • dError/db = (y-t)
      Adjust weights/bias by subtracting a small multiple of their corresponding gradients.
62
Q

curse of dimensionality

A

In general when you have a set number of features (points) the amount of dimension adds more an more space on which to place those points, essencialy making them spread and isolated in the vast spaces. If your dataset has many features (columns), each feature is essentially a dimension. Incrising number of dimensions adds space between points, making the distances large and hence not very informative. High dimensions make the distances larger and magnify differences. iit may lead to model finding patterns that are just noise.

Imagine a line segment in one dimention. If you want to sample enough points to reasonably represent it, you don’t need many points. Now consider two dimensions: a square. To get the same level of coverage as you had on the line, you need exponentially more points.
Scale up to three dimensions: a cube. The need for data points explodes even further. This pattern continues with each added dimension.

In high-dimensional spaces, data points become extremely sparse. It’s like a handful of sand scattered across a football field. The way we typically calculate distances (like Euclidean distance) starts to break down. In high dimensions, the distance between any two points tends to become similar, making it harder to distinguish meaningful relationships.

Pixels in an image can be seen as dimensions. High-resolution images suffer particularly strongly. Representing words as vectors (e.g., word embeddings) can lead to very high-dimensional spaces.

Some algorithms (like k-nearest neighbors) are more sensitive to the curse of dimensionality than others. Techniques like Principal Component Analysis (PCA) or feature selection help reduce the number of dimensions before modeling. Techniques like reguralization prevent models from overfitting are crucial in high-dimensional scenarios. Surprisingly, with enough data, you can sometimes overcome the curse – but the amount of data needed grows exponentially with dimensions.

63
Q

Data-Oriented Programming

A

Programming paradigm —a style of building the structure and elements of computer programs—that focuses on organizing and manipulating data efficiently, often prioritizing performance and memory optimization over traditional software design considerations. It emphasizes data layout, data locality, and cache coherence to maximize computational efficiency, especially in applications with large datasets or high-performance requirements. Data-oriented programming is commonly used in game development, scientific computing, and parallel processing, where data processing speed and memory usage are critical.

64
Q

Declarative Programming

A

Programming paradigm —a style of building the structure and elements of computer programs—that emphasizes expressing what should be accomplished rather than how it should be achieved. OR put another way paradigm that expresses the logic of a computation without describing its control flow.. It focuses on declaring the desired results or properties of a computation, leaving the details of execution to the underlying system. Declarative programming languages and frameworks enable concise and expressive code, promote modularization and abstraction, and facilitate reasoning about programs. Examples of declarative programming paradigms include functional programming, logic programming, and database query languages.

65
Q

Dimensionality reduction

A

A process of reducing the number of features or variables in a dataset while preserving its essential information and structure. It aims to overcome the curse of dimensionality, improve computational efficiency, and enhance model interpretability and generalization. Dimensionality reduction techniques include feature selection, feature extraction, and manifold learning methods such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and autoencoders. Dimensionality reduction is widely used in machine learning and data analysis for visualization, preprocessing, and model training

66
Q

Few-shot learning

A

Machine learning paradigm focused on rapidly adapting to new tasks with minimal labeled examples. In traditional supervised learning, models need massive datasets to learn patterns. FSL challenges this, aiming to mimic the human ability to learn a concept from just a few examples. It often utilizes pre-trained models fine-tuned on small datasets with these core ideas:

Meta-Learning: FSL models “learn to learn.” They are trained on various similar tasks, allowing them to quickly generalize to new, unseen tasks.
Metric Learning: Techniques are used to compare examples, learning a similarity function. A new example is classified based on its similarity to a small set of labeled examples (“the few shots”).
Applications: FSL is invaluable in domains with limited data, such as rare disease classification or new language translation.

Overall, few-shot learning pushes the boundaries of machine learning, working towards AI that can adapt and learn with more efficiency and flexibility.

67
Q

Lazy

A

Lazy learning models don’t create a complex model or generalization from the training data right away. Instead, they store the training data and wait until they need to make a prediction.

When you ask a lazy learner a question (give it a new data point), it quickly analyzes the stored training data, focusing only on the examples most similar to your question.
Since they don’t build a model upfront, lazy learners train very quickly. They can easily handle changing data, as they don’t have a rigid, pre-built model to update.

Imagine you have a bunch of labeled pictures of cats and dogs. A lazy learner wouldn’t bother learning general “cat” or “dog” features. When shown a new picture, it would find the most similar pictures in its stored memory and predict based on them.

type of learning algorithm that postpones the processing of training data until it receives a query for prediction. These algorithms do not build an explicit model during the training phase but rather store the training instances and their associated labels or values in memory. When a prediction is requested for a new data point, the algorithm searches through the stored training instances to find the most similar ones and makes predictions based on their characteristics.

The main characteristic of lazy learning algorithms is that they delay generalization until the prediction phase. This contrasts with eager learning algorithms, which build a generalization model from the training data upfront and use it to make predictions without requiring access to the original training instances.

Advantages of lazy learning algorithms include:

Adaptability to complex, high-dimensional data: Lazy learning algorithms can handle complex data structures and high-dimensional feature spaces without needing to define an explicit model during training.
Flexibility in handling non-linear relationships: Lazy learning algorithms can capture non-linear relationships between features and target variables without the need for assumptions about the data distribution.
Incremental learning: Lazy learning algorithms can easily adapt to new data without the need for retraining the entire model, making them suitable for scenarios where the underlying data distribution may change over time.
Interpretability: Since lazy learning algorithms rely on storing and retrieving training instances for making predictions, their decision-making process can be more interpretable compared to eager learning algorithms that use complex models.
However, lazy learning algorithms also have some drawbacks:

Computational inefficiency: Making predictions with lazy learning algorithms can be computationally expensive, especially for large datasets, as it requires searching through the entire training dataset for each prediction.
Sensitivity to noise and irrelevant features: Lazy learning algorithms may be sensitive to noise and irrelevant features in the training data, as they rely on similarity measures to make predictions.
Memory requirements: Storing the entire training dataset in memory can require significant memory resources, particularly for large datasets with many instances.
Overall, lazy learning algorithms offer a flexible and adaptable approach to machine learning, particularly in scenarios where the underlying data distribution is complex or changes over time. However, their computational costs and memory requirements need to be carefully considered when applying them to real-world problems.

Nearest Neighbors (k-NN): One of the most well-known lazy learning algorithms. In k-NN, predictions are made based on the majority class (for classification) or the average of neighboring points (for regression) among the k nearest neighbors to the query point in the feature space.
Radius Neighbors: Similar to k-NN, but instead of considering a fixed number of neighbors (k), it considers all neighbors within a specified radius around the query point.
Local Outlier Factor (LOF): An unsupervised algorithm for outlier detection. LOF measures the local density deviation of a data point with respect to its neighbors, identifying points with significantly lower density as outliers.
Kernel Density Estimation (KDE): Estimates the probability density function of a continuous random variable based on the local densities of data points around a query point. KDE is often used for density estimation and anomaly detection tasks.
Locally Weighted Regression (LWR): A non-parametric regression algorithm that estimates the value of a target variable for a given query point by fitting a regression model using weighted linear regression, where the weights are inversely proportional to the distance between the query point and the training instances.
Case-Based Reasoning (CBR): A problem-solving paradigm where new problems are solved by adapting solutions to similar past problems. CBR stores a database of past cases (training instances) and retrieves the most similar cases to the current problem for making predictions or decisions.
Prototype-Based Classification: Classifies new instances by comparing them to a set of prototype instances stored in memory. Similarity measures (e.g., Euclidean distance, cosine similarity) are used to determine the most similar prototypes to the query instance, and predictions are made based on the labels of these prototypes.

68
Q

Neural Network Weight Decay

A

Neural network weight decay is a regularization technique aimed at preventing overfitting and improving the generalization performance of neural networks. Here’s how it works:

The Problem of Overfitting: Complex neural networks can often memorize training data perfectly, failing to generalize well to unseen examples.
Intuition: Weight decay encourages smaller weights in the model. This preference for smaller weights leads to simpler and smoother model functions, which tend to generalize better.
How It’s Done: A regularization term is added to the loss function during training. This term penalizes large weights, effectively decaying them towards zero after each update step.
Hyperparameter: The strength of weight decay is controlled by a hyperparameter. Finding the optimal value often involves experimentation.
In essence, weight decay acts like a gentle push towards simpler models while still allowing the network to learn the necessary representations from the data.

69
Q

Q-learning

A

Q-learning is a type of reinforcement learning algorithm where an agent learns to make optimal decisions in an environment by interacting with it and learning from the consequences of its actions. The core idea is to build a “Q-table” that represents the expected future rewards (Q-values) for taking each possible action in each state of the environment. The agent updates this table based on rewards it receives. Over time, it learns to choose actions that maximize its long-term expected reward, even if some actions don’t lead to immediate rewards. For example, a Q-learning robot in a maze can learn the best path to a reward by trying different routes, observing the outcomes, and updating its understanding of which moves lead to success.

70
Q

Sparsity in the activations.

A

In a neural network layer, when a significant portion of the neurons have output values of zero or close to zero. It is an opposite of dense activation most or all neurons in a layer have significant outputs.

Why Sparsity Matters: Processing zeroes is faster. Sparse activations lead to faster computations in neural networks. During calculations (like matrix multiplications) within a neural network, zero values don’t contribute anything. Software and hardware can be optimized to skip multiplications by zero, significantly speeding up computations. Sparse computations often map well to parallel processing architectures (like GPUs), as independent calculations can be done simultaneously.
Sparse representations can be stored more compactly. Instead of storing a full, dense matrix of activation values, you can develop techniques that only store the non-zero elements and their positions. This can lead to significant reductions in memory footprint. Sparse models trained with appropriate techniques may end up requiring fewer neurons overall to achieve similar performance, leading to smaller model files.
Some research suggests sparsity can act as a form of regularization, helping prevent overfitting. Sparsity can be seen as forcing the model to rely on a smaller subset of features or connections. This inherently limits the complexity of the model, making it less prone to memorizing noise in the training data.

ReLU (Rectified Linear Unit): Naturally encourages sparsity (outputs zero for negative inputs). in Image Processing: Sparse representations can highlight important image features efficiently. Natural Language Processing: Sparse activations can help models focus on relevant words or phrases in text. Compressive Sensing: Sparsity is fundamental in techniques that reconstruct signals from fewer samples.

Too much sparsity can hinder the network’s ability to learn complex patterns. Finding the optimal balance is important.

71
Q

Pipeline (Problem Definition and Data Collection)

A
  1. Existing Data Sources:
    Internal Databases: Leverage company-owned databases containing customer data, transactions, product information, etc.
    Open Datasets: Explore publicly available datasets (e.g., Kaggle, UCI Machine Learning Repository, government data portals).
    Third-Party Data Providers: Purchase data from vendors specializing in specific domains (finance, healthcare, etc.).
  2. Web Scraping
    Automated Web Extraction: Develop scripts or use specialized tools to extract structured data from websites, such as product information, reviews, or social media posts.
  3. Application Programming Interfaces (APIs)
    Third-Party APIs: Access real-time data from social media platforms, weather services, financial data providers, and more.
    Creating Your Own API: If you have internal data, creating an API can make it easily accessible for machine learning projects.
  4. Surveys and Forms
    Online Surveys: Use survey platforms to collect structured data on opinions, behaviors, or preferences.
    Manual Data Entry: For small amounts of data or specific information, manual input into forms or spreadsheets may be necessary.
  5. Sensor Data
    IoT Devices: Collect data from sensors measuring temperature, location, motion, and other environmental variables.
    Wearables: Smartwatches and fitness trackers collect health and activity data.
  6. Experiments
    A/B Testing: Compare the effects of different treatments or stimuli on users or systems, collecting data on their responses.
    Controlled Studies: Design experiments to isolate specific variables and understand their impact on a target outcome.
72
Q

Pipeline (Data Preprocessing & Exploration)

A

This process is iterative. Findings in EDA might require revisiting cleaning or feature engineering.
The choice of techniques depends heavily on your dataset’s nature and the machine learning task at hand.

Data Cleaning:
* Handling Missing Values:
* Removal of rows or columns (if feasible)
* Imputation (Mean, median, mode, predictive modeling)
* Addressing Outliers and Inconsistent Data
* Identification (Box plots, scatter plots, statistical methods)
* Capping/Flooring values
* Removal (if justified)
* Transformations (e.g., log transformation)
* Formatting:
* Data type conversions (numeric, string, date/time)
* Text preprocessing (tokenization, stemming, lemmatization, stop word removal)
* Reshaping (If Needed)
* Pivoting tables
* Transposing data
* Splitting or combining columns
* Changing the dimensions of tensors for compatibility with various model architectures (e.g., convolutional neural networks, recurrent neural networks)
* Resampling
* Undersampling (majority class)
* Oversampling (minority class)
* Synthetic data generation (e.g., SMOTE)
* Stratified Sampling (to maintain class proportions in a smaller sample)
* Data Integration
* Schema matching and merging
* Entity resolution (identifying records that refer to the same entity)
* Data Augmentation (Especially relevant for image, audio, or signal processing)
* Image transformations (rotations, flipping, cropping, color adjustments)
* Audio transformations (adding noise, time stretching, pitch shifting)
* Text augmentation (synonym replacement, back-translation)

Exploratory Data Analysis (EDA):
* Statistical Summaries
* Measures of central tendency (mean, median, mode)
* Measures of spread (variance, standard deviation, quartiles)
* Visualizations
* Histograms
* Box plots
* Scatter plots
* Correlation matrices (heatmap)
* Time Series Analysis
* Analyzing trends and seasonality
* Decomposing time series into components (trend, seasonality, residuals)
* Clustering
* K-means clustering
* Hierarchical clustering

Feature Engineering:
* Feature Creation:
* Deriving new features through calculations or combinations of existing features
* Domain-specific transformations
* Interaction features (representing interactions between two or more existing features)
* Feature Selection
* Filter methods (e.g., correlation analysis, chi-squared test)
* Wrapper methods (e.g., recursive feature elimination)
* Embedded methods (e.g., decision trees, L1 regularization)
* Dimensionality Reduction
* Principal Component Analysis (PCA)
* Linear Discriminant Analysis (LDA)
* t-SNE (primarily for visualization)
* Autoencoders
* Feature Transformation
* Feature scaling (normalization, standardization)
* Encoding categorical features (label encoding, one-hot encoding)
* Feature Binning (Discretization)
* Text representation techniques:
* TF-IDF (Term Frequency-Inverse Document Frequency)
* Word Embeddings (e.g., Word2Vec, GloVe)