Build, Train and Tune Model Flashcards

Question

Gating

Answer 1

A process of controlling or modulating the flow of information within neural networks using gating mechanisms. Gating mechanisms selectively filter, amplify, or suppress information based on learned or predefined criteria, allowing networks to focus on relevant features or suppress irrelevant noise. Gating is commonly used in recurrent neural networks (RNNs) through mechanisms such as LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) cells to regulate the flow of information over sequential data.

Answer 2

It's a type of statistical noise where the probability distribution of the noise, meaning the values the noise can take on, follows a Gaussian distribution (also known as a normal distribution). A Gaussian distribution forms the classic bell-shaped curve, where values near the mean (average) are the most likely, and the the probability decreases as values move further away. It is characterized by random fluctuations with a mean of zero and a constant variance, resulting in a symmetric distribution around the mean. Gaussian noise is often added to signals or data to simulate random variability or uncertainty, model measurement errors, or introduce randomness in stochastic processes. In machine learning, Gaussian noise is sometimes injected into input data or model parameters to regularize the learning process, prevent overfitting, or augment the training data.

Answer 3

A measure of the impurity or randomness of a set of elements in a classification problem. It quantifies the probability of misclassifying an element randomly chosen from the set if it were labeled according to the class distribution in the set. A lower Gini index indicates higher purity and better separation of classes, while a higher Gini index indicates higher impurity and mixing of classes. The Gini index is commonly used as a criterion for splitting nodes in decision trees and evaluating the quality of splits in decision tree algorithms such as CART (Classification and Regression Trees).

Answer 4

Technique used to deal with exploding gradients.The central idea is simple: If the gradient exceeds a certain threshold, you clip its magnitude to stay within a reasonable range. Here are the common methods: Clipping by Value: - You define a minimum and maximum threshold. - If a gradient component is less than the minimum, clip it to the minimum value. - If a gradient component is larger than the maximum, clip it to the maximum value. Clipping by Norm: - Calculate the norm of the gradient vector (e.g., L2 norm) - If the norm exceeds a threshold, rescale the entire gradient vector so its norm is equal to the threshold. This preserves the direction of the gradient while limiting its magnitude. The ideal threshold is problem-dependent, but some experimentation often helps. While it helps with exploding gradients, clipping doesn't address vanishing gradients. Other techniques (e.g., careful weight initialization, LSTMs) may be needed as well. Gradient clipping helps prevent giant updates that derail your learning process. But it makes the model less sensitive to the choice of learning rate since you're capping how much change can occur in a single update.

Answer 5

An iterative optimization algorithm used to minimize the loss function and find the optimal parameters (weights and biases) of a machine learning model. It works by iteratively adjusting the model parameters in the opposite direction of the gradient of the loss function with respect to the parameters. By following the gradient, the algorithm seeks to descend along the steepest path towards the minimum of the loss function. During GD NN parameters recive an update proportional to the partial derivative of the cost function with respect to hte current parameter in each iteration of training. Gradient descent comes in different variants, including batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with its own trade-offs in terms of convergence speed, memory usage, and computational efficiency. The magnitude of the gradient for a specific weight or bias signifies how sensitive the error is to changes in that parameter. A larger gradient indicates that a change in that weight or bias will have a more significant impact on the error. The sign of the gradient tells us whether to increase or decrease the parameter. A positive gradient suggests increasing the parameter will reduce the error, while a negative gradient means decreasing it will be helpful.

Answer 6

Process of optimizing the hyperparameters of machine learning models using gradient-based optimization algorithms. Instead of manually tuning hyperparameters or using grid search techniques, gradient-based methods leverage the gradients of a chosen performance metric (e.g., validation loss) with respect to the hyperparameters. By iteratively updating the hyperparameters in the direction that minimizes the loss, these methods efficiently search the hyperparameter space and find optimal or near-optimal configurations. Examples include Bayesian optimization, which models the performance metric as a probabilistic surrogate function and uses its gradients to guide the search, and gradient-based meta-learning approaches, which learn to adapt hyperparameters during training.

Answer 7

Grid Search Cross-Validation (CV) is a technique used to tune the hyperparameters of a machine learning model by exhaustively searching through a specified grid of hyperparameter values and evaluating each combination using cross-validation to determine the optimal set of hyperparameters.

Answer 8

Histogram of Oriented Gradients (HOG) is a feature extraction technique used in computer vision and image processing to represent the local texture and shape information of an image. HOG computes histograms of gradient orientations within localized regions of the image and concatenates these histograms to form a feature vector that describes the overall structure of the image. HOG features are commonly used in object detection, pedestrian detection, and other tasks where capturing shape and texture information is important.

Answer 9

Also known as validation sets or validation data, are subsets of the dataset used to evaluate the performance of a machine learning model during training. Holdout sets are distinct from the training set and are not used for model parameter estimation. Instead, they are used to assess the generalization performance of the model on unseen data and to tune hyperparameters such as learning rate, regularization strength, and model architecture. Holdout sets are typically held out from the training process and only used intermittently to monitor the model's performance and prevent overfitting. You might use a holdout set iteratively throughout development, adjusting your model based on its performance. The test set is meant to be used only once. If you use a holdout set repeatedly to tune your model, you risk it subtly influencing your choices and biasing your estimation. The test set, held strictly separate, avoids this. Holdout set is used primarily during the model development process. Test set is used for final, rigorous assessment reserved until the very end of the development process. This is meant to give an unbiased estimate of the final model's performance to help you decide if it's ready for deployment.

Answer 10

In the context of deep learning frameworks such as PyTorch and TensorFlow, hooks are callback functions or mechanisms used to intercept and observe internal states or operations of neural network modules during the forward and backward passes. Hooks allow users to inspect and manipulate intermediate activations, gradients, and other internal variables of the network for debugging, visualization, and research purposes. Hooks are commonly used for feature visualization, gradient-based optimization, and model interpretation in deep learning workflows.

Answer 11

The C hyperparameter in SVMs acts as a regularization parameter. It navigates the trade-off between: Large C: Enforces a stricter decision boundary, aiming to classify training examples correctly even if it leads to a more complex model (risks overfitting). Small C: Allows for a wider margin around the decision boundary, accepting some misclassifications on the training data for the sake of better generalizing to unseen data (risks underfitting). When training an SVM, the goal is to find a hyperplane that separates the classes in your data while maximizing the margin (the distance between the hyperplane and the closest data points for each class). The C parameter controls the penalty applied for having data points within the margin or on the wrong side of the hyperplane.

Answer 12

Tree-Specific Parameters: max_depth: Maximum depth of each tree. min_child_weight: Minimum sum of instance weight needed in a child node. subsample: Subsample ratio of the training instance. colsample_bytree: Subsample ratio of columns when constructing each tree. colsample_bylevel: Subsample ratio of columns for each level. colsample_bynode: Subsample ratio of columns for each split. max_delta_step: Maximum delta step allowed for each tree's weight estimation. gamma: Minimum loss reduction required to make a further partition on a leaf node. lambda: L2 regularization term on weights. alpha: L1 regularization term on weights. scale_pos_weight: Control the balance of positive and negative weights. Learning Task Parameters: objective: The learning objective or loss function. eval_metric: Evaluation metric for validation data. num_class: Number of classes in a multi-class classification. Learning Control Parameters: eta or learning_rate: Step size shrinkage used to prevent overfitting. n_estimators or num_boost_round: Number of boosting rounds or trees to build. early_stopping_rounds: Early stopping to prevent overfitting based on a validation dataset. verbose: Verbosity level. silent: Whether to print messages during training. Additional Parameters (Specific to Certain Implementations): Parameters specific to XGBoost: e.g., tree_method, booster, gpu_id, max_bin. Parameters specific to LightGBM: e.g., boosting_type, num_leaves, max_bin, device. Parameters specific to CatBoost: e.g., depth, border_count, l2_leaf_reg.

Answer 13

A mathematical function used to compute the similarity or dot product between pairs of data points in a higher-dimensional space without explicitly mapping them to that space. Kernel functions are commonly used in kernel methods, such as support vector machines (SVMs) and kernel ridge regression, to transform input data into a higher-dimensional feature space where linear separation or regression is easier to achieve. Popular kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid kernels, each with its own characteristics and suitability for different types of data.

Answer 14

A technique used in machine learning to implicitly map input data into a higher-dimensional feature space using kernel functions without explicitly computing the transformed feature vectors. By applying the kernel trick, kernel methods such as support vector machines (SVMs) and kernel ridge regression can operate in the original input space while benefiting from the advantages of working in a higher-dimensional feature space, such as increased expressiveness and improved separability of classes or patterns. The kernel trick allows these algorithms to efficiently handle non-linear relationships in the data and perform complex pattern recognition tasks without explicitly computing the feature vectors.

Answer 15

While both kernels in Linear algebra and ML involve inner products (dot products), ML kernels aren't strictly about finding the null space like in linear algebra. ML kernels serve as generalized similarity measures, with high values indicating strong relatedness, often within a high-dimensional space. This enables algorithms to discover nonlinear patterns in the data, which is essential for many real-world tasks. In other words, at their heart, machine learning kernels are functions that compute the degree of similarity between two data points and they often unlock the power of working in higher-dimensional spaces without the computational cost of explicitly transforming the data. This is known as the "kernel trick." A kernel's output signifies the degree of similarity. This could be based on a simple linear relationship or a complex, nonlinear pattern. Many kernels implicitly correspond to similarity measures in high-dimensional or even infinite-dimensional spaces. The beauty is, these calculations happen without directly transforming the data, saving computational resources. Many kernels rely on dot products. This connection stems from the dot product's role in measuring vector alignment. Not All Similarity Measures Are Kernels: Kernels must satisfy the mathematical property of positive definiteness to be valid and useful in ML algorithms. * Linear Kernel: The simple dot product. * Polynomial Kernel: Dot product raised to a power, implicitly capturing feature combinations. * Gaussian Kernel (RBF): Computes similarity based on scaled distance, interpretable as a transformation and then a dot product. Beyond Dot Products * Not All Similarity Measures Are Kernels: Kernels must satisfy the mathematical property of positive definiteness to be valid and useful in ML algorithms. * Custom Kernels: Researchers design specialized kernels to compare complex structures like text, graphs, and trees. Understanding the Kernel Trick * The power of the kernel trick lies in calculating a complex transformation's results without explicitly performing it. This is massively computationally efficient. * This implicit mapping often helps find nonlinear patterns and decision boundaries, making complex problems tractable.

Answer 16

A hyperparameter that determines the step size or rate at which the parameters (weights and biases) of a machine learning model are updated during training using optimization algorithms such as gradient descent. A higher learning rate results in faster convergence but may risk overshooting the optimal solution or causing instability, while a lower learning rate may lead to slower convergence or getting stuck in local minima. The learning rate is a critical hyperparameter that requires careful tuning to ensure optimal performance and convergence speed of the training process.

Answer 17

Measures the error of a model's prediction on a single data point or example. However, it is often used interchangably with "Cost function" with measures general error of the model across the data set

Answer 18

The terms "loss function" and "cost function" are often used interchangeably to refer to the function that measures the error or discrepancy between the model's predictions and the true labels in the training data. However, in some contexts, the term "loss function" is used to refer to the function applied to a single training example, while the term "cost function" refers to the aggregate or average loss over the entire training dataset.

Answer 19

In data modeling, a many-to-many relationship is when multiple instances of one entity can be related to multiple instances of another entity, and vice-versa.

Answer 20

A pooling operation used in convolutional neural networks (CNNs) to downsample feature maps and reduce spatial dimensions while retaining important features. Max pooling divides the input feature map into non-overlapping regions (typically squares) and outputs the maximum value within each region, discarding the rest. This process reduces the spatial size of the feature maps, making them more computationally efficient to process and less sensitive to small spatial variations. Max pooling is commonly used after convolutional layers in CNN architectures to progressively reduce the spatial resolution of feature maps while preserving important features.

Answer 21

Mini-batch refers to the practice of dividing your large dataset into smaller, fixed-size groups of samples called mini-batches. Instead of updating the model's parameters based on the entire dataset at once (batch gradient descent) or using a single example at a time (stochastic gradient descent), the model is updated after processing each mini-batch. This approach strikes a balance between the stability of batch gradient descent and the speed of stochastic gradient descent, leading to faster convergence and better generalization for most deep learning problems.

Answer 22

A technique used in optimization algorithms, particularly in gradient descent variants, to accelerate convergence and improve optimization performance. In momentum-based optimization, the update to the model parameters (weights and biases) is not only influenced by the current gradient but also by a momentum term that accumulates previous gradients. It incorporates a fraction of the previous update into the current update, creating a kind of 'rolling snowball' effect. By incorporating momentum, the optimization algorithm gains inertia and smooths out oscillations in the gradient descent trajectory, allowing it to overcome local minima and escape saddle points more effectively. Momentum helps accelerate convergence, improve stability, and navigate complex optimization landscapes in machine learning models.

Answer 23

Also known as joint loss or composite loss, is a loss function used in multitask learning to optimize multiple learning objectives simultaneously. In multitask learning, a single model is trained to perform multiple related tasks simultaneously, leveraging shared representations and transfer learning to improve performance on each task. The multitask loss function aggregates the individual losses from each task into a single composite loss, which is minimized during training using gradient-based optimization algorithms. Multitask loss encourages the model to learn task-specific features while sharing knowledge and information across tasks, leading to more efficient learning and better generalization performance.

Answer 24

Achived by activation functions that take linear function and perform a non-linear operation on it. Neural networks by themselves wouldn't be very powerful without the concept of non-linearity. Each layer in a neural network performs a linear combination of its inputs, which is like drawing a straight line through the data. But the real world is full of complex relationships, not straight lines. Here's where non-linearity comes in: it's achieved through activation functions applied after each layer's linear operation. These functions introduce bends and curves, allowing the network to model complex patterns. Imagine stacking multiple curved functions together – like building a curvy road – the network can learn to represent very intricate relationships between the input data and the desired output, making it suitable for tasks like image recognition or speech translation.

Answer 25

Optimizing means finding min or max of function. Process of adjusting the parameters (weights and biases) of a machine learning model to minimize a predefined objective function or loss function. The goal of optimization is to find the optimal set of parameters that best fits the training data and generalizes well to unseen data. Optimization algorithms, such as gradient descent and its variants, iteratively update the model parameters based on the gradients of the loss function with respect to the parameters. Optimization techniques play a critical role in training machine learning models, ensuring convergence, stability, and efficiency in the learning process.

Answer 26

Parametrizing a model is more akin to setting up the skeletal structure, which determines the types and potential number of parameters the model will learn. Parameterizing a model refers to the process of defining and setting the parameters (also known as weights and biases) that govern the behavior of the model. In the context of machine learning, parameters are the variables that the model learns from the training data to make predictions or perform a specific task. The goal of parameterizing a model is to find the optimal values for these parameters that minimize the difference between the model's predictions and the actual outcomes. The model's structure and parameters determine the kinds of patterns it could learn, much like a recipe defines a range of possible cakes. The actual power of the model comes from finding the best possible parameter settings during training.

Answer 27

Pooling works similarily to convolution but instead of applying trainable filter it applies a fixed operator like max or average. Pooling has only hyperparameters (stride, type(max, average), size of the filter). By reducing information it contributes to speeding the network training.

Answer 28

A dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving the most important information. PCA identifies the principal components, which are the orthogonal axes that capture the maximum variance in the data. By projecting the data onto these principal components, PCA reduces the dimensionality of the dataset while minimizing information loss. PCA is widely used for data visualization, noise reduction, and feature extraction in various machine learning and data analysis tasks.

Answer 29

Propagation generally refers to how information or changes flow through a complex, interconnected system. Essentialy it means consequential calculations. In machine learning, it occurs in two key areas: 1. Forward Propagation (Input to Output) Input data is fed into the first layer of a neural network. Calculations propagate forward, layer by layer, with each layer applying weights, biases, and activation functions. Finally, the output layer produces the prediction or classification. Think of it as a chain reaction, where the output of one stage triggers calculations in the next. 2. Backward Propagation (Errors for Training) Backpropagation (the heart of training neural networks) is where "propagation" becomes truly crucial. The prediction error (difference between the desired output and the network's actual output) is calculated. This error signal is propagated backward through the network, layer by layer. Using calculus (chain rule), the contribution of each weight and bias to the error is determined. Weights and biases are adjusted slightly in a direction that minimizes the error (gradient descent).

Answer 30

A technique used in machine learning and neural networks to reduce the size and complexity of the model by removing unnecessary or redundant parameters, connections, or nodes. Pruning helps improve model efficiency, reduce overfitting, and enhance generalization performance by simplifying the model structure and removing irrelevant features. Pruning can be applied during training or after training by setting small weights or connections to zero based on certain criteria, such as magnitude, importance, or contribution to the overall model performance.

Answer 31

In the vast space of possible hyperparameter combinations, random search is often surprisingly effective compared to exhaustive searches. Unlike grid search, it doesn't assume that hyperparameters impact performance in a smooth or monotonic way. This can be helpful when the relationship between hyperparameters and performance is complex. By sampling randomly, it can often find good-enough hyperparameter values much faster than trying every possible combination in a grid. To do it you specify a distribution (ranges) for each hyperparameter to search over (e.g., uniform distribution between a minimum and a maximum value). The algorithm randomly samples combinations of hyperparameters from the defined distributions. When to Use It: Good at the beginning when you have little knowledge about which hyperparameters are important. Useful when you have many hyperparameters to tune. Useful when you're limited by time or computational resources.

Answer 32

The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is a popular kernel function used in kernel methods, particularly in support vector machines (SVMs) and kernelized regression models. The RBF kernel computes the similarity or distance between data points in a higher-dimensional space using the Gaussian distribution. It is defined as: K(x, x′) = exp(−2σ2 ||x − x′||2), where x and x′ are data points, ||⋅||2 denotes the squared Euclidean distance, and σ is a hyperparameter that controls the spread of the kernel. The RBF kernel is versatile and effective for capturing non-linear relationships in the data.

Answer 33

The Rectified Linear Unit (ReLU) function is a non-linear activation function commonly used in neural networks to introduce non-linearity and enable the network to learn complex patterns and relationships in the data. The ReLU function is defined as: f(x) = max(0, x) This means it outputs the input x if it is positive and zero otherwise. ReLU activation is computationally efficient, easy to implement, and helps mitigate the vanishing gradient problem during training. It's widely used in deep learning architectures like convolutional neural networks (CNNs) and feedforward neural networks.

Answer 34

Resampling methods in machine learning are techniques used to modify or create new training data sets by randomly sampling from the original data set. These methods are particularly useful for tasks such as model evaluation, model selection, and dealing with imbalanced data sets. They can improve the performance of models when certain classes are severely underrepresented. Also they can reduces the variance of model evaluation when you have limited data. Resampling shouldn't be the first solution for imbalance. Sometimes collecting more data or algorithmic adjustments are better. Especially with over-sampling, there's a risk that the model will memorize the specific oversampled examples. Over-Sampling: Used for imbalanced datasets where some classes are underrepresented. Techniques include: Random Over-Sampling: Replicating examples from the minority class (can lead to overfitting). SMOTE: Generating synthetic new examples in the minority class by interpolating between existing minority examples. ADASYN: Similar to SMOTE, but focuses on creating more difficult minority examples near class boundaries. Under-Sampling: Used when you have an abundance of data in certain classes. Techniques include: Random Under-Sampling: Randomly removing examples from the majority class (risks discarding valuable information). Cluster Centroids: Replacing groups of majority class samples with the cluster centroid. Tomek Links: Identifying and removing borderline majority examples that might be mislabeled or noisy. Bootstrapping: Used to estimate the statistical properties of a model, such as its accuracy or confidence intervals. It involves randomly sampling with replacement from the original dataset to create multiple bootstrap samples, each of the same size as the original dataset. The model is trained on each bootstrap sample, and the aggregated results are used to estimate the model's performance or statistical properties. Bootstrapping is particularly useful when the dataset is limited or when estimating uncertainty in model predictions. Cross-Validation: Used to evaluate the performance of a machine learning model. The dataset is divided into multiple subsets, or folds, with a portion of the data reserved for training and the rest for testing. The model is trained on the training set and evaluated on the test set. This process is repeated multiple times, with each fold used as the test set exactly once. Common types of cross-validation include k-fold cross-validation, stratified k-fold cross-validation, leave-one-out cross-validation, and repeated cross-validation.

Answer 35

A type of neural network architecture in which the layers are arranged sequentially, one after the other, with each layer feeding its output as the input to the next layer. Sequential models are simple and easy to understand, making them suitable for a wide range of machine learning tasks, including classification, regression, and sequence prediction. SEquential models are an one of the alternatives among non-sequential models, tree-based models and others

Answer 36

The sigmoid function, also known as the logistic function, is a non-linear activation function commonly used in neural networks to introduce non-linearity and squash the output of a neuron into the range (0, 1). The sigmoid function is defined as: f(x) = 1 / (1 + e^-x) where e is the base of the natural logarithm. The sigmoid function produces a smooth S-shaped curve, suitable for binary classification and outputting probabilities. However, it's prone to saturation and vanishing gradients for extreme input values, potentially slowing learning in deep neural networks.

Answer 37

In deep neural networks, a skip connection is a direct link that bypasses one or more layers, allowing information to "jump" ahead. Instead of data only flowing sequentially through layers, skip connections introduce alternative paths, creating a multi-layered, less linear structure. When networks get very deep, gradients used for updating weights during backpropagation can vanish or explode. Skip connections ease the flow of gradients back through the layers, helping with training very deep models. Skip connections allow earlier layers' information to directly reach later layers. This can help preserve important features that might otherwise get diluted as data passes through many transformations. Skip connections create networks that resemble ensembles of shallower models. This can improve performance and reduce overfitting. One of the most famous examples of using skip connections is ResNets. In ResNets, a block of layers' output gets added to the input before entering the next block, creating these skip paths.

Answer 38

A technique used in signal processing, computer vision, and time series analysis to process data in a sequential manner by moving a fixed-size window or kernel over the input data one step at a time. The sliding window approach allows for local feature extraction, segmentation, or analysis of data streams by capturing temporal or spatial patterns within the window. Sliding windows are commonly used in tasks such as object detection, edge detection, motion estimation, and feature extraction from time series data. They offer flexibility, adaptability, and efficiency in processing large datasets and streaming data in real-time. (is it the same thing as sliding a kernel?)

Answer 39

Softmax function is a generalization of the sigmoid function to multidimensional outputs. a mathematical function that converts a vector of raw scores or logits into a probability distribution over multiple classes. It is commonly used as the output activation function in multi-class classification problems, where the goal is to predict the probability of each class given a set of input features. The softmax function computes the probability of each class as the exponential of the input score divided by the sum of the exponentials of all scores, ensuring that the output probabilities sum to one. Softmax activation produces a smooth probability distribution that can be interpreted as the model's confidence in each class prediction.

Answer 40

The state is a dynamic representation of the sequence the RNN has seen so far. Think of it as the current output of the memory cell, reflecting what it "remembers" from the past inputs. The state changes with every step in the sequence. The state is calculated using the current input AND the weights of the RNN. Analogy: Recipe (Weights): The instructions for how to bake a cake. Bowl of Batter (State): What's currently in the bowl after following some of the recipe steps. The batter changes as you add ingredients (inputs), but the recipe itself remains the same. In Recurrent Neural Networks (RNNs), the "state" (often called the "hidden state") refers to a piece of information that the network carries forward through time as it processes a sequence. Think of it as the RNN's memory. At each step, the RNN takes the current input and the previous state, combines them, and produces an output and an updated state. Representation: The state is a vector of numbers, which can be thought of as a learned representation of the sequence up to that point.

Answer 41

Stochastic Gradient Descent (SGD) is a variant of the gradient descent optimization algorithm used to train machine learning models by updating the model parameters (weights and biases) iteratively based on the gradients of the loss function computed on small random subsets of the training data. Unlike batch gradient descent, which computes the gradients using the entire training dataset, SGD approximates the gradients using mini-batches of data, leading to faster convergence and reduced memory requirements. SGD is widely used in deep learning and large-scale machine learning tasks due to its computational efficiency and ability to handle large datasets.

Answer 42

In convolutional neural networks (CNNs), the stride refers to the step size or displacement with which the convolutional kernel slides or moves across the input data during the convolution operation. The stride determines the amount of overlap between adjacent receptive fields and affects the spatial dimensions of the output feature maps. A stride of one (default) means that the kernel moves one pixel at a time, resulting in output feature maps with the same spatial dimensions as the input. Larger strides reduce the spatial dimensions of the output feature maps, leading to spatial downsampling, while smaller strides increase the spatial dimensions, leading to spatial upsampling. Stride plays a crucial role in controlling the spatial resolution and receptive field size in CNN architectures.

Answer 43

The desired output or ground truth associated with each input data point in the training dataset. The target represents the correct label, class, or value that the machine learning model aims to predict or approximate during training. In classification tasks, the target is typically a categorical label or class label indicating the correct category or class membership of the input data. In regression tasks, the target is a continuous numerical value representing the true output or target variable associated with the input data. The goal of supervised learning is to train the model to predict the target accurately based on the input features and minimize the discrepancy between the predicted output and the true target value.

Answer 44

During Training a vocabulary is built. Even very large datasets can't cover every possible word, especially with misspellings, new terms, slang, and proper nouns. If a model encounters an unseen word in the wild, it would break down if it didn't have a way to handle it. That is why we need additional token . It's a placeholder used to represent words that were not encountered during the training phase of a language model. These words are considered out-of-vocabulary (OOV). Generally, there's one token in the vocabulary, and all unknown words are mapped to this token. it is therfore not exacly unknown word token: it is a token vector of averaged, generalized features of all words that were unknown. Think of unknown words as being 'collapsed' into the token. It doesn't store individual unknown words, but the representation is shaped by the statistical patterns of how diverse unknown words are used in the contexts where they appear. Also, often words below a certain frequency threshold are replaced with the token. It is usefule because even though the model will never output the word that was unknown, it can still properly learn the features of other words. If those words have features that tell us how they relate to the words that are unknown having this token help us preserve those features. However it is not perfect. Multiple unknown words with contradictory sentiment in the same sentence confuse the model, meaning: The more different the unknown words are, the less effective a single token becomes. In general unkown word relies on the scenario where there are not many unnown words or at least they are similar. Otherwise the vector is unspecific, captures all types of features and points in all sorts of directions. This is why techniques like character-level modeling or subword tokenization are helpful, especially when you expect diverse unknown words. Here are some more developed techniques that address the limitations of the basic token approach: 1. Character-Level Models * No Unknown Words: Instead of operating on word-level tokens, these models break down text into individual characters or sequences of characters. * Robustness: Any word combination can be represented, eliminating the need for an token altogether. * Trade-offs: * Complexity: Computationally more expensive. * May need more data to learn meaningful character-level patterns. 2. Subword Tokenization * Smaller Units: Words are split into subword units, such as common prefixes, suffixes, and roots (e.g., "undesirable" might become "un-," "desir-," "-able"). * Reduced Unknowns: Significantly decreases the vocabulary size needed, leading to fewer instances of unknown words. * Flexibility: Techniques like Byte Pair Encoding (BPE) or SentencePiece learn the subword vocabulary directly from your data. 3. Leveraging Context Dynamically * Attention-Based Models (e.g., Transformers): These models excel at dynamically weighting the importance of different words in a sentence. This makes them more robust to unknown words, as they can focus on the known words that provide a stronger signal. * Pretrained Language Models: Like BERT or GPT-3, are trained on massive text datasets, giving them a wider vocabulary and better ability to understand context around unknown words. 4. Hybrid Approaches * Combining Techniques: Using character-level or subword models for unknown words while maintaining a word-level vocabulary for frequent terms can be a powerful hybrid approach. * Data Augmentation: Strategically replacing known words with during training can improve the model's ability to handle unseen words.

Answer 45

A problem encountered during the training of deep neural networks, where the gradients of the loss function with respect to the model parameters become extremely small as they are backpropagated through many layers of the network. In deep networks, as gradients are propagated backward from the output layer to the input layer during backpropagation, they can diminish exponentially with each layer due to the repeated application of activation functions with small derivatives, such as sigmoid or hyperbolic tangent functions. As a result, the updates to the parameters in the early layers become negligible, leading to slow convergence and difficulty in learning meaningful representations from the data. Vanishing gradient can impede the training of deep networks and affect their ability to capture complex patterns in the data.

Answer 46

In the context of Convolutional Neural Networks (CNNs), "volume" typically refers to the three-dimensional structure of input data, intermediate representations (feature maps), and learnable parameters (filters or kernels) within the network architecture.. In CNNs, we don't just deal with flat images but 3-dimensional block of data: height, width, colour chanels ( 224 pixels x 224 pixel x 3). As data flows through the convolutional and pooling layers of a CNN, the dimensions of the representations change. However, they still maintain this 3D volume-like structure. But "Volume" can also refer to the number of images processed simultaneously during training or inference, known as the batch size. "Volume" might refer to the rate at which images are processed per unit of time, such as images processed per second (IPS) or frames per second (FPS), especially in real-time applications like video processing. Additionally, "volume" can represent the amount of data contained within a single image, particularly when dealing with high-dimensional data such as medical images or satellite imagery.

Answer 47

1. Sigmoid: Output range: (0, 1) Best usage: Output layer for binary classification tasks. Less common in hidden layers due to vanishing gradient problem. 2. Hyperbolic Tangent (tanh): Output range: (-1, 1) Best usage: Hidden layers in neural networks for general-purpose tasks. 3. Rectified Linear Unit (ReLU): Output range: [0, +∞) Best usage: Hidden layers in deep neural networks. Effective in alleviating the vanishing gradient problem. 4. Leaky ReLU: Output range: (-∞, +∞) Best usage: Alternative to ReLU, providing a small, non-zero gradient for negative inputs, helpful in preventing dying ReLU problem. 5. Parametric ReLU (PReLU): Output range: (-∞, +∞) Best usage: Similar to Leaky ReLU but with the negative slope α learned during training. 6. Exponential Linear Unit (ELU): Output range: (-∞, +∞) Best usage: An alternative to ReLU, with smoother transitions for negative inputs. 7. Scaled Exponential Linear Unit (SELU): Output range: (-∞, +∞) Best usage: Introduced for maintaining mean and variance during training, works well for deeper architectures. 8. Softmax: Output range: (0, 1) for each element, with the sum being 1. Best usage: Output layer for multi-class classification tasks to obtain probabilities for each class. 9. Softplus: Output range: (0, +∞) Best usage: A smooth approximation of ReLU, often used in shallow networks or when smoothness is preferred over sparsity. 10. Swish: Output range: (-∞, +∞) Best usage: Introduced as a more effective alternative to ReLU, tends to perform well across a range of tasks, particularly in deeper models.

Answer 48

A technique used in machine learning to prevent overfitting and improve the generalization of models by adding a regularization term to the loss function. It combines the penalties of L1 (Lasso) and L2 (Ridge) regularization methods, allowing both feature selection and coefficient shrinkage. Elastic Net regularization is particularly useful when dealing with high-dimensional datasets where there are many features, as it helps to automatically select relevant features while shrinking coefficients towards zero to avoid model complexity.

Answer 49

Overfitting occurs when a machine learning model learns to perform well on the training data but fails to generalize to unseen data. Several techniques can help address overfitting: 1. Data-Focused Techniques Gather More Data Data Augmentation: Increasing the diversity of the training data through techniques such as data augmentation helps the model learn more robust features. Feature Selection: Remove irrelevant or redundant features that might be causing your model to focus on spurious patterns. 2. Regularization: Techniques like L1 and L2 regularization add penalty terms to the loss function, discouraging the model from learning overly complex patterns. 3. Model-Based Strategies Early Stopping: Monitoring the model's performance on a validation set during training and stopping when performance starts to degrade can prevent overfitting. Ensemble Methods: Combining predictions from multiple models (e.g., bagging, boosting) can reduce overfitting by leveraging the wisdom of crowds. Simpler Models: Start with less complex models (e.g., fewer layers, fewer neurons). They're less prone to memorize intricate patterns of the training data. 4. Other Approaches Cross-Validation: Trains multiple models on different splits of the data, helping in selecting hyperparameters and getting a more robust sense of the model's generalization. Hyperparameter Tuning: Experiment with learning rates, regularization strength, batch sizes, etc., to find settings that help prevent overfitting.

Answer 50

Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the data. It hasn't learned enough complexity to capture the patterns. It will have poor performance on both the training set and the validation set. 1. Increase Model Complexity: More Layers/Neurons: Add layers or increase the number of neurons in existing layers in your neural network. More Complex Models: Switch to a model with higher inherent capacity Reduce Regularization: If your regularization terms (L1, L2, dropout) are too strong, reduce their influence. 2. Feature Engineering: Create New Features: Extract more informative features from your existing data Reduce Noise: Remove irrelevant features or outliers in your data that might confuse the model. Adjust Hyperparameters: Experiment with different hyperparameter settings, such as learning rate, batch size, and network architecture, to find a configuration that reduces underfitting. 3. Train Longer: Increase Epochs: Allow the model more passes over the training data to learn more complex patterns. Monitor: Use a validation set to check if performance continues to improve, and consider early stopping if not. Increase Training Data: Collect more training data to provide the model with more examples to learn from.

Answer 51

learning_rate: Controls step size during gradient-based updates. Smaller values lead to slower, potentially more accurate training. depth: Maximum depth of each decision tree. Deeper trees can model more complex patterns but are prone to overfitting. iterations: The number of trees to train. More trees usually improve performance but increase training time. l2_leaf_reg: L2 regularization term to prevent overfitting. Number of trees: Number of boosting iterations. Bagging temperature: Controls the randomness in sampling during training.

Answer 52

max_depth: Maximum depth allowed for the tree. Restricts complexity and helps prevent overfitting. min_samples_split: Minimum samples needed to consider a split at a node. Makes the tree less sensitive to noise. criterion: Function to measure the quality of a split (e.g., "gini" for impurity, "entropy" for information gain). Minimum samples leaf: Minimum number of samples required to be at a leaf node.

Answer 53

alpha: Overall regularization strength. A higher alpha combines more L1 and L2 regularization. l1_ratio: The mix between L1 and L2 regularization. Closer to 1 promotes sparsity (feature selection), closer to 0 favors grouped feature selection. max_iter: Maximum number of iterations for the optimization algorithm. Tolerance: Convergence threshold for optimization.

Answer 54

learning_rate: Controls the contribution of each tree, promoting gradual learning and reducing overfitting. n_estimators: Number of trees in the ensemble. More trees generally improve accuracy. subsample: Fraction of data to sample for each tree, promoting diversity and reducing overfitting. Maximum depth: Maximum depth of the individual trees.

Answer 55

Initialization method: Method for initializing cluster centroids (e.g., random or K-means++). n_clusters: The "k" - the number of clusters you want the algorithm to find. Maximum iterations: Maximum number of iterations for convergence. Tolerance: Convergence threshold for termination.

Answer 56

n_neighbors: The "k" - number of neighbors to consider for classification or regression. weights: How to weight neighbors (e.g., 'uniform' for equal, 'distance' to give closer ones more influence). p: Parameter for the Minkowski distance metric. p=1 is Manhattan distance, p=2 is Euclidean distance. (other distances are Cosine Similarity, Jaccard Distance) Algorithm: Algorithm used to compute nearest neighbors (e.g., brute force or kd-tree).

Answer 57

learning_rate: Step size for updates. Lower values lead to slower but potentially more accurate training. num_leaves: Maximum number of leaves in each tree, controlling model complexity. max_depth: Another way to limit tree depth and prevent overfitting. feature_fraction: Fraction of features to randomly select at each tree split, promoting diversity.

Answer 58

Regularization: Type (L1, L2, or none) and strength (controlled by an alpha/lambda parameter) to prevent overfitting. Intercept: Whether to include an intercept term in the model. Solver: Optimization algorithm for fitting the model (e.g., ordinary least squares or gradient descent). Tolerance: Convergence threshold for optimization.

Answer 59

Reguralization type (L1, L2, or none) Regularization strength C: Inverse regularization strength (smaller C means stronger regularization). solver:Optimization algorithm for fitting the model (e.g., Newton's method or stochastic gradient descent). Maximum iterations: Maximum number of iterations for optimization.

Answer 60

smoothing: Add pseudo-counts to avoid zero-probability issues (common with categorical variables). Distribution assumption: Assumption about the distribution of the features (e.g., Gaussian or multinomial). Prior probabilities: Prior probabilities of the classes (if not estimated from the data).

Answer 61

degree: Degree of the polynomial fit to the data. Higher degrees model more complex relationships. Regularization strength: Strength of L1 or L2 regularization (if used). Interaction terms: Whether to include interaction terms between features. Solver: Optimization algorithm for fitting the model (e.g., ordinary least squares or gradient descent).

Answer 62

n_estimators: Number of trees in the forest. More trees improve accuracy but slow down prediction. max_depth: Maximum tree depth, regulating complexity. min_samples_leaf: Minimum samples needed in a leaf node to allow further splitting. max_features: Number of features to consider at each split, introducing randomness. Criterion: Function to measure the quality of a split (e.g., Gini impurity or entropy).

Answer 63

Type of kernel ("linear", "rbf", "poly") to define the similarity space. Complexity depends on the kernel choice. C: Regularization parameter, controlling the trade-off between fitting the training data and keeping a smooth decision boundary. Kernel function: Type of kernel used for mapping data into higher-dimensional space (e.g., linear, polynomial, or radial basis function). Kernel coefficient (Gamma): Coefficient for non-linear kernel functions. Degree: Degree of the polynomial kernel function (if polynomial kernel is used).

Answer 64

eta: Equivalent to learning_rate. Learning rate: Controls the step size during boosting. gamma: Minimum loss reduction needed to create a new split, promoting conservative trees. Subsample: Fraction of samples used for fitting the individual trees. Number of estimators: Number of boosting iterations. Maximum depth: Maximum depth of the trees.

Answer 65

Machine learning models want to "learn" patterns from data to make predictions. Overfitting happens when a model becomes too complex and starts fitting the random noise in the training data instead of the true underlying pattern. L1 regularization adds a penalty term to the model's cost function (the thing it's trying to minimize). This penalty is based on the size of the model's coefficients (weights). The L1 penalty encourages the model to shrink coefficients towards zero. Some coefficients might even become exactly zero. By setting coefficients to zero, L1 regularization effectively performs feature selection – it helps identify the most important features for making predictions. Lasso leads to models that are easier to interpret and understand. It helps prevent overfitting, improving the model's ability to perform well on new data. Lasso helps pick out the most important features, simplifying the model and potentially improving performance.

Answer 66

A technique used in machine learning to prevent overfitting. Overfitting happens when a model becomes too complex and fits the noise in your training data rather than the underlying pattern. L2 regularization adds a penalty term to the model's cost function, which is based on the square of the size of the model's coefficients (weights). This penalty is based on the square of the size of the model's coefficients (weights). Instead of driving some coefficients to zero like L1 regularization, L2 regularization encourages them to shrink, making them smaller but not necessarily eliminating them. This leads to smoother models that are less likely to overfit. While L2 regularization doesn't perform explicit feature selection like L1, it can still help reduce model complexity and make it easier to understand. However, due to the squaring of values, L2 regularization is a bit more sensitive to outliers than L1.

Answer 67

The learning rate is a crucial hyperparameter in machine learning that dictates how aggressively a model adjusts its parameters during training. Gradient descent, a common training algorithm, calculates the direction of steepest increase in error (the gradient). The learning rate controls how big of a step your model takes in the opposite direction of the gradient, aiming to descend towards the lowest point on the landscape, representing the minimum error. A too-small learning rate leads to slow, timid steps, making convergence take a long time or possibly causing the model to get trapped in local minima (small dips, but not the global minimum). Conversely, an overly large learning rate can result in reckless leaps, potentially causing the model to overshoot the valleys or oscillate without ever converging. Finding the right learning rate is tricky as there's no single magic number – it depends on your specific problem, dataset, and even where you are in the training process. Often, practitioners start with a reasonable default value (like 0.01) and experiment. Techniques like learning rate schedulers (which decrease the learning rate over time) and adaptive optimizers like Adam (which adjust learning rates somewhat automatically) can significantly simplify finding an effective learning rate.

Build, Train and Tune Model Flashcards

(92 cards)