Build, Train and Tune Model Flashcards
(92 cards)
Activation Function
A mathematical function applied to the output of each neuron in a neural network to introduce non-linearity and enable the network to learn complex patterns and relationships in the data. Common activation functions include sigmoid, tanh, ReLU (Rectified Linear Unit), Leaky ReLU, and softmax. Activation functions play a crucial role in determining the output of neural networks and affect the network’s training dynamics, convergence speed, and performance.
Activation map
Also known as a feature map, is a two-dimensional array or tensor that represents the output of a layer in a convolutional neural network (CNN) after applying an activation function. Each element in the activation map corresponds to the activation value of a specific neuron in the layer, capturing the presence of certain features or patterns in the input data. Activation maps are used for visualizing and interpreting the learned representations in CNNs and are instrumental in understanding how the network processes and transforms input data.
Adam Optimization
Adam (Adaptive Moment Estimation) is an optimization algorithm commonly used to update the parameters of neural networks during training. It combines the benefits of adaptive learning rate methods (such as RMSprop) and momentum-based optimization techniques to achieve faster convergence and better generalization performance. Adam computes adaptive learning rates for each parameter based on past gradients and stores exponentially decaying averages of past gradients and squared gradients. It is widely used in deep learning frameworks for training various types of neural network architectures.
Backpropagation
A fundamental algorithm for training neural networks. It involves computing the gradient of a loss function with respect to the network’s parameters, then using this gradient to update the parameters in the direction that minimizes the loss. This process is repeated iteratively to optimize the network’s performance. Backpropagation enables neural networks to learn from data by adjusting their internal parameters to better approximate the desired output for a given input. For example, in a simple feedforward neural network used for image classification, backpropagation adjusts the weights connecting neurons in each layer to reduce the difference between the predicted class and the actual class of each image in the training dataset.
Backpropagation through time
An extension of the backpropagation algorithm specifically designed for training recurrent neural networks (RNNs) over sequential data. It unfolds the network through time, treating each time step as a layer, and computes gradients using the chain rule of calculus. BPTT is widely used in tasks such as speech recognition, natural language processing, and time series prediction, where the input data is sequential and has temporal dependencies. However, BPTT suffers from the vanishing gradient problem, where gradients diminish exponentially over long sequences, making it challenging to learn dependencies over extended periods.
Batch Normalization
A technique used to improve the training of deep neural networks by normalizing the input of each layer across mini-batches of data. It reduces internal covariate shift and accelerates convergence by stabilizing the distributions of layer inputs. Batch Normalization helps mitigate issues such as vanishing gradients, enables the use of higher learning rates, and acts as a regularizer, reducing the need for other regularization techniques. It is typically applied after the activation function in each layer of a neural network, normalizing the output of the preceding layer before passing it to the next layer.
Bayesian hyperparameters optimization
A method used to efficiently search for optimal hyperparameters of machine learning algorithms by modeling the objective function as a probabilistic surrogate model. It leverages Bayesian techniques to iteratively update a probabilistic model of the objective function based on observed evaluations, allowing for more effective exploration of the hyperparameter space.
Purpose: The goal of Bayesian Hyperparameters Optimization is to find hyperparameters that maximize the performance of a machine learning model while minimizing the number of evaluations required.
Example: In practice, Bayesian optimization is used in tasks such as tuning the hyperparameters of support vector machines, random forests, and deep neural networks, where manually searching the hyperparameter space would be prohibitively time-consuming.
Bootstrapping
A resampling technique used in statistics and machine learning to estimate the sampling distribution of a statistic by repeatedly sampling with replacement from the observed data. In bootstrapping, multiple samples (bootstrap samples) are drawn from the original dataset, and statistical estimates or models
Bottleneck layer
Bottleneck layers in neural networks act like compression belts for information flow. Imagine a large crowd trying to squeeze through a narrow tunnel. The bottleneck layer is that tunnel, forcing the network to compress its data into a lower-dimensional representation.
A bottleneck layer has fewer neurons compared to the layers before and after it. This “bottleneck” forces the network to identify the most critical information and discard redundancy. Despite the size reduction, the bottleneck layer aims to capture the essence of the data. It does this by applying filters that highlight the most significant features learned by previous layers. By reducing data size, bottleneck layers make the network more efficient. They require fewer calculations and can help prevent overfitting, a situation where the network memorizes specifics instead of learning general patterns. Often, bottleneck layers are used in conjunction with residual connections. These connections allow the network to bypass the bottleneck entirely and add the original, uncompressed data to the output. This ensures the network retains important details while still benefiting from the efficiency gains.
Used for example in autoencoders.
Bounding box
A rectangular or cuboidal area used to encapsulate objects or regions of interest in an image or scene. It is defined by its coordinates, typically represented as (xmin, ymin, xmax, ymax) for 2D bounding boxes in image space. Bounding boxes are commonly used in computer vision tasks such as object detection, instance segmentation, and object tracking to localize and identify objects within images or video frames.
Checkpoints in models
Checkpoints are snapshots of a model’s parameters saved during training. These checkpoints include the model’s architecture, weights, optimizer state, and other relevant parameters. Checkpoints are crucial for resuming training from a specific point, fine-tuning models, or deploying trained models for inference. They allow practitioners to monitor training progress, prevent data loss in case of interruptions, and facilitate model evaluation and experimentation.
Checkpoints are saved in standardized formats (e.g., TensorFlow’s SavedModel, PyTorch’s .pth files) and managed using tools like callbacks in TensorFlow or torch.save() in PyTorch. They ensure reproducibility, scalability, and reliability in machine learning workflows.
Convolving
Convolving refers to the process of applying a convolution operation to input data using a convolutional kernel or filter. In the context of image processing and computer vision, convolution is used to extract features from images by sliding a kernel over the input image and computing the dot product between the kernel and local regions of the image. Convolving is a fundamental operation in convolutional neural networks (CNNs) and is used to detect patterns, edges, textures, and other visual features in images.
Criterion
An objective function or measure used to evaluate the performance of a model or algorithm. The criterion quantifies how well the model’s predictions match the true outcomes or how effectively the algorithm achieves its objectives. Common criteria in machine learning include loss functions, accuracy, precision, recall, F1-score, and mean squared error. The choice of criterion depends on the specific task, dataset, and optimization goals.
Data Loader
A component or module in machine learning frameworks and libraries used to load, preprocess, and batch input data for training or inference. Data loaders are responsible for reading data from storage (e.g., disk, database), applying data transformations (e.g., normalization, augmentation), and organizing data into batches suitable for efficient processing by machine learning models. Data loaders play a critical role in managing large datasets, handling data pipelines, and optimizing the training process for deep learning models.
Decision boundary
A hypersurface or boundary that separates different classes or categories in the feature space of a classification problem. It represents the region where the decision function changes from predicting one class to another. In binary classification tasks, the decision boundary is typically a line, curve, or hyperplane that partitions the feature space into two regions corresponding to different class labels. Decision boundaries are learned by machine learning algorithms based on the training data and model parameters and are used to make predictions on new or unseen data points.
Cenoising autoencoder
A type of artificial neural network used for learning efficient representations of data by removing noise or corruption from input samples. Unlike traditional autoencoders, denoising autoencoders are trained to reconstruct clean or uncorrupted versions of input data from noisy observations. They learn robust features that capture the underlying structure of the data while filtering out irrelevant or noisy information. Denoising autoencoders find applications in dimensionality reduction, feature learning, and unsupervised pretraining in machine learning and deep learning.
Dropout
A regularization technique used in neural networks to prevent overfitting and improve generalization performance. During training, dropout randomly deactivates (sets to zero) a proportion of neurons in a layer with a specified dropout rate. This prevents individual neurons from relying too heavily on specific features or co-adapting with other neurons and encourages the network to learn more robust and generalizable representations. Dropout is commonly used in deep learning models, especially fully connected and convolutional neural networks.
EOS Token
End of sentence. EOS token acts like a full stop in a sentence. It’s a special symbol that signals the end of an output sequence.
During training, the model learns to associate the EOS token with the end of a coherent sentence or translation. When generating text, the model keeps producing words or tokens until it outputs the EOS token. The EOS token is crucial because it allows models to generate sequences of different lengths. Without it, the model wouldn’t have a clear signal for when to stop generating text. In machine translation, the EOS token tells the decoder (the part generating the target language) that the source sentence has been fully processed, and it’s time to wrap up the translation.
Evolutionary hyperparameters optimization techinques
methods inspired by principles of natural selection and evolution to search for optimal hyperparameters in machine learning models. These techniques typically involve the use of evolutionary algorithms, such as genetic algorithms, evolutionary strategies, or genetic programming, to explore the hyperparameter space and find combinations that result in improved model performance. Evolutionary hyperparameters optimization techniques are useful when dealing with complex optimization problems or when traditional methods such as grid search or random search are impractical or inefficient.
Inspiration for mechanism:
- Population of Solutions: Instead of manually trying different hyperparameter combinations, you start with a population of random solutions (sets of hyperparameters).
- Fitness Evaluation: Each solution is evaluated, often by training a model with those hyperparameters and seeing how it performs on a validation set.
- Survival of the Fittest: The best-performing solutions have a higher chance of being selected for the next generation.
- Crossover and Mutation: New solutions are created by:
- Crossover: Combining elements from two good parent solutions
- Mutation: Making small random changes to existing solutions. This helps explore the search space.
- Repeat: This process repeats for several generations, with the aim that better solutions evolve over time.
Examples of Evolutionary Algorithms
- Genetic Algorithms (GA): Solutions are represented like chromosomes; crossover and mutation are modeled after biological processes.
- Particle Swarm Optimization (PSO): Solutions are like particles moving through space, influenced by their own best-found position and the globally best positions.
- Differential Evolution (DE): New solutions are generated based on differences between existing solutions in the population.
Exploding gradient
During training, neural networks update their weights using backpropagation, which calculates the error (the difference between the predicted and true value) and propagates it backward through the layers to compute gradients. Gradients tell us how much to adjust the weights. In deep neural networks, these gradients can get multiplied through many layers.
If the weights are initialized too large, or certain conditions arise within the network (too many layers, too large weight initialized, some activation functions like sigmoid saturate functions around specifiv values like 0 or 1), these multiplied gradients can become extremely large, leading to the exploding gradient problem.
Huge gradients result in massive updates to the network weights during training. This can lead to instability: The model may wildly overshoot the optimal solution orweights might become so large they overflow to NaN (Not a Number), breaking your training entirely.
Techniques such as gradient clipping and normalization are often used to mitigate the problem of exploding gradients.
Exploration vs. Exploitation
A fundamental trade-off in decision-making and optimization, particularly in reinforcement learning and multi-armed bandit problems.
Exploration refers to the process of gathering information about the environment or exploring different options to discover potentially better solutions.
Exploitation, on the other hand, involves leveraging known information or exploiting current knowledge to maximize immediate rewards or benefits.
Balancing exploration and exploitation is essential for learning and decision-making in dynamic environments, where the goal is to achieve a balance between gathering new information and exploiting existing knowledge to optimize long-term performance. Machine learning models learn from data. Your dataset is often a mere snapshot of all possible scenarios. A model too focused on exploiting the knowledge in your current data may perform poorly on new (overfitting), unseen data (this is overfitting). Exploration is needed to help it generalize better.
The exploration-exploitation dilemma is most pronounced in reinforcement learning (RL), where an agent learns through interacting with an environment and receiving rewards. The same principles apply for example in:
- Epsilon-greedy (particularly in RL): Take a random, exploratory action with a small probability.
- Decaying Epsilon: Start with a lot of exploration and decrease it over time.
- Optimism in the face of uncertainty: Favor under-explored actions or parts of the parameter space, providing an incentive for the model to try new things.
Feature detector (kernel, filter)
A feature detector, also known as a kernel or filter, is a small matrix or template used in convolutional neural networks (CNNs) to extract specific features or patterns from input data. Feature detectors are applied to input data using a convolution operation, where the filter is convolved with the input to produce feature maps. Different types of feature detectors (e.g., edge detectors, texture detectors, shape detectors) are designed to capture different aspects of the input data and are learned or manually defined during the training process.
Fold
In cross-validation, a fold refers to a distinct subset of data used for training and validation. The dataset is divided into multiple folds, typically of equal size, where each fold is used as a validation set exactly once while the remaining folds are used for training. The cross-validation process is repeated for each fold, ensuring that every data point is used for both training and validation.
For example, in k-fold cross-validation, the dataset is divided into k folds. The model is trained k times, with each fold used once as a validation set and the remaining k-1 folds used for training. The performance metrics are averaged over all k runs to provide an overall estimate of the model’s performance. Cross-validation helps in assessing the generalization performance of a model, detecting overfitting, and tuning hyperparameters.
Fully connected layers
Also known as dense layers or fully connected neural networks (FCNNs), are layers in artificial neural networks where each neuron is connected to every neuron in the preceding layer. In a fully connected layer, each neuron receives input from all neurons in the previous layer and computes a weighted sum of these inputs, followed by an activation function to produce the output. Fully connected layers are commonly used in feedforward neural networks and deep learning architectures for tasks such as classification, regression, and feature learning.