Models & Elements Flashcards

Question

K-Nearest Neighbours

Answer 1

A simple and intuitive machine learning algorithm used for classification and regression tasks. Given a new data point, KNN predicts its class label or numerical value based on the majority vote or average of its k nearest neighbors in the training dataset. KNN relies on the assumption that similar data points tend to belong to the same class or have similar target values. It is a non-parametric and lazy learning algorithm that does not require training a model explicitly.

Answer 2

A supervised learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the input features and the target variable. Linear regression does search for the line (or hyperplane in higher dimensions) that minimizes the sum of the squared distances (or residuals) between the observed data points and the predicted values on the line. This method is known as the method of least squares. Linear regression models are commonly used for prediction and inference tasks, and they provide interpretable coefficients that indicate the strength and direction of the relationships between variables.

Answer 3

Traditional neural networks struggle to handle data with a sequential nature (e.g., text, time series). RNNs address this by having a "memory" mechanism to retain information from previous steps. But long sequences make it hard for RNNs to learn long-range dependencies. Information from earlier steps can fade away as it's propagated. LSTMs are an advanced type of RNN cell designed to overcome the vanishing gradient problem. They not just remember last states but they have ability to decide what parts are worth of remembering. Key Components of an LSTM Unit: 1. Cell State: This is the "long-term memory" of the LSTM. It runs through the entire chain, with only minor interactions, keeping information flowing. 2. Gates: These are what make LSTMs special: Forget Gate: Selectively decides what information from the previous cell state should be discarded. Input Gate: Determines what new information from the current input should be added to the cell state. Output Gate: Controls which parts of the updated cell state become part of the output. How it Works (Simplified) a) The forget gate looks at the previous hidden state and current input and decides what old information to keep. b) The input gate processes the current input and creates a "candidate" for updating the cell state. c) The cell state is updated by combining parts of the old state (what the forget gate didn't discard) and the new candidate values. d) The output gate selects relevant parts of the cell state to generate an output. At their core, LSTMs are neural network layers with a complex internal structure. This includes the cell state and the three gates (forget, input, and output). The gates contain sigmoid and hyperbolic tangent activation functions. Each gate and the calculations for updating the cell state involve sets of weights and biases. These are just like the weights and biases found in other parts of a neural network. LSTMs are trained as part of the overall neural network using the same principles of gradient descent and backpropagation:

Answer 4

A probabilistic machine learning model based on Bayes' theorem and the assumption of conditional independence between features. It calculates the probability of each class label given a set of input features and selects the class label with the highest probability as the predicted label for the input. Despite its simplicity and the naive assumption of feature independence, Naive Bayes classifiers are widely used for text classification, spam filtering, and other classification tasks.

Answer 5

NER is a subfield of Natural Language Processing (NLP) focused on automatically identifying and classifying specific entities within a body of text. These are predefined categories like: People, Organizations (e.g., Google), Locations (e.g., France), Dates & Times (e.g., July 4th, 2023), Quantities (e.g., $1 Million), ... and even custom entity types for your specific application. Various ML algorithms are used for NER, including Traditional ML like Conditional Random Fields (CRFs), Support Vector Machines (SVMs), Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), Transformers. The NER model predicts the type of entity each word or group of words represents (or indicates that it's not an entity). Why NER is important: * NER Extracts structured data from unstructured text which unlocks many applications. it helps tasks like machine translation, question answering, and text summarization. * Business Applications include: Customer support chatbots can identify key issues and people mentioned. Analyzing legal documents to extract contract terms. Monitoring news feeds for relevant company or market trends. Challenges * Ambiguity: Words can belong to different categories depending on context (e.g., 'Apple' could be a company or a fruit). * New Entities: Models need to be adaptable to handle previously unseen entities. Typically, the output of an NER system might look like this: Original Text: "John Doe visited Paris on July 4th, 2023 and met with the CEO of Acme Inc." NER Output: * John Doe (Person) * Paris (Location) * July 4th, 2023 (Date)

Answer 6

A neuron is the most basic processing unit within an artificial neural network. The concept of artificial neurons in neural networks is loosely inspired by biological neurons in the brain. Biological neurons receive signals (inputs) through connections called dendrites, process them, and send an output signal through the axon if a certain threshold is met. Neural networks learn by adjusting the weights and biases during training. The goal is to find the optimal values that produce the desired output given a specific input. Artificial neural networks are organized into layers: an input layer, one or more hidden layers, and an output layer. Neurons in one layer are connected to neurons in the next, creating a complex network of calculations. In a neural network, a neuron is a mathematical function that performs the following: 1) Inputs: A neuron receives multiple input values. These inputs could come from raw data (e.g., pixel values of an image) or be the outputs of neurons from a previous layer in the neural network. 2) Weights: Each input is multiplied by a corresponding weight. Weights are like knobs that determine how much influence each input has on the neuron's output. 3) Summation: The weighted inputs are summed together. 4) Bias: A bias term is added to the sum. The bias is like an adjustment that helps the neuron learn how much we want to activate this neuron 5) Activation Function: The result of the summation (and bias) is passed through a non-linear activation function. This function introduces non-linearity into the model, which is essential for neural networks to learn complex patterns. Common activation functions include: - Sigmoid - Tanh - ReLU (Rectified Linear Unit) 6) Output: The output of the activation function is the final output of the neuron. This output can then be sent to neurons in the next layer of the neural network. Simple Analogy Imagine a neuron like a decision-maker. Consider the decision of whether to wear a coat outside: 1) Inputs: Temperature, wind speed, likelihood of rain. 2) Weights: How heavily you weigh each factor (you might care more about temperature than wind, etc.) 3) Bias: Your general predisposition towards wearing a coat (some people are more likely to get cold). 4) Activation Function: Your mental model deciding if the combined factors cross a threshold for putting on a coat. 5) Output: The decision – coat or no coat.

Answer 7

sometimes called debiasing. Real-world data often contains biases reflecting social prejudices or historical patterns of discrimination. ML models trained on this biased data learn and perpetuate these biases, resulting in unfair or harmful predictions. Neutralization is a collection of techniques aimed at reducing the influence of these unwanted biases in ML models. Approaches to Neutralization Pre-processing: Modifying the training data to be more balanced or remove sensitive attributes. In-processing: Changing the model's training process: Regularization terms to penalize reliance on biased features. Adversarial learning setups where a part of the model tries to identify biases to help another part counteract them. Post-processing: Adjusting model outputs to ensure fairness according to specific metrics. Most importandly in NLP: These vector representations of words, which are foundational for many NLP tasks, can capture societal biases. For example, "doctor" might be closer to "man" and "nurse" closer to "woman" in the embedding space. Debiasing Techniques in NLP Data Pre-processing Balanced Corpora: Curating datasets that have more balanced representation of different groups or perspectives. Data Augmentation: Generating synthetic examples to counterbalance underrepresented groups or viewpoints. Embedding Debiasing Geometric Techniques: Realigning word embeddings in the vector space to mitigate biased associations. Contextualized Embeddings: Instead of static word vectors, using models like BERT that dynamically generate embeddings based on the surrounding sentence, reducing some forms of bias. Model-level Adjustments Adversarial Training: Using a setup where one part of the model tries to predict a protected attribute (like gender) from the text, and the other part tries to perform the main task without relying on that protected attribute. Fairness-aware Regularization: Adding terms to the loss function that penalize biased predictions across groups.

Answer 8

If you have multiclass problem but binary classification alghoritm. We crate an number of classifiers. Iin each we leave the label of target class and turn other classes to zero. We build a model for each class (1, 2, 3 ,…) beeing a target class. WE make final prediction by putting data through each model. we get 3 probabilities. the highest wins.

Answer 9

One-Class Classification, unary classification or class modeling tries to identify object of only that one class among many other object classes. These classifiers are used for outlier detection, anomaly detection and novelty detection. Common examples are: one-class Gausian, one-class k-means, one-class kNN, one-class SVM.

Answer 10

Pre-trained word embeddings are a foundational concept in Natural Language Processing (NLP). In essence, they map words and phrases from a human vocabulary to numerical vectors. These vectors aren't just random but are learned on massive text datasets in a way that captures semantic and syntactic relationships between words. Imagine words plotted in a high-dimensional space – pre-trained embeddings ensure that words like "cat" and "dog" are closer together than "cat" and "airplane". This allows NLP models to understand the context and nuances of language. Pre-trained models like Word2Vec or GloVe are a form of transfer learning, where knowledge learned from a massive text corpus can be leveraged to boost the performance of new NLP tasks, even on smaller datasets.

Answer 11

Pooling layers are like mini summarizers that shrink the size of data while keeping important features. Imagine looking at an image and identifying the overall shapes and edges, rather than getting hung up on every tiny detail. Pooling works similarly, by applying a filter (often a 2x2 square) that slides across the data, summarizing the information within each window. There are different pooling operations, like averaging or taking the maximum value, to capture the most important essence of that local area. This reduction in size makes the data easier to manage for the network, reduces the number of calculations needed, and helps the network focus on broader patterns instead of getting bogged down in precise details that might not be critical for the task at hand. Imagine the feature map as a grid, and the pooling layer has a small sliding window (e.g., 2x2). As this window moves across the grid, it applies a specific operation like max pooling (taking the highest value within the window) or average pooling (calculating the mean). This process distills the most significant information from each region, decreasing the data size without losing crucial patterns. By doing this, pooling layers make the network computationally lighter, decrease the chance of overfitting (getting too fixated on minor details), and help the CNN focus on larger, more relevant features for its image recognition or classification tasks. Pooling layer usually follows convolutionlayer

Answer 12

Used in active learning. We train multiple models and then ask expert to label only those examples about which models disagree the most. Imagine having a committee of undecided learners (algorithms) all trained on the same data set. When faced with a new, unlabeled data point, QBC doesn't ask a single learner for its guess. Instead, it pits two learners against each other. If they disagree on how to classify the data point (because it falls in a confusing area for them), then QBC assumes this point holds valuable information for improving everyone's learning. This "disagreement" becomes the query, prompting the labeling of that specific data point. By focusing on points where learners are unsure, QBC efficiently selects the most informative data for labeling, ultimately leading to a better-trained committee (and all the individual learners within it).

Answer 13

Random Forest: Builds multiple decision trees independently and in parallel. Final prediction is based on the majority vote (classification) or average (regression) across trees. AdaBoost: Creates a sequence of weak learners (often decision stumps), where each subsequent learner focuses on correcting the mistakes of the previous one. Final prediction is a weighted combination of the weak learners. Gradient Boost: Similar to AdaBoost, it trains a series of weak learners sequentially. Each learner aims to correct the residual errors of the previous ensemble. AdaBoost iteratively increases the weight of misclassified examples, forcing subsequent learners to focus on difficult cases, while Gradient Boost directly trains new learners to predict the residuals (errors) of the current ensemble, progressively refining the predictions. This makes Gradient Boost a more general algorithm that can optimize various loss functions, often leading to higher accuracy but also a greater susceptibility to overfitting compared to AdaBoost.

Answer 14

Type of semi-supervised learning. We build model using labeled examples. Then we use that model to label unlabeled examples. IF the confidence score of label meets the threshold we add it to the data set. Then we rebuild the model iterativly unil we have the whole data set labelled. Unfortunatylly these models are not very accurate

Answer 15

A Sparsely Connected Layer (SCL) is a type of neural network layer where not every neuron is connected to every neuron in the previous layer. This contrasts with traditional fully-connected layers where all neurons have connections. Here's why SCLs matter: Reduced Complexity: Fewer connections mean smaller models, faster computation, and less memory requirement – ideal for resource-constrained scenarios like mobile devices. Potential for Overfitting Reduction: Sparse connections can act as a form of regularization, potentially preventing models from overfitting to the training data. Biological Inspiration: SCLs are loosely inspired by the brain, where neurons are not fully interconnected either. Finding Optimal Sparsity: A key challenge with SCLs is finding the right level of sparsity and the best connection patterns. This may involve techniques like pruning less important connections or using algorithms designed to discover optimal sparse structures. Overall, sparsely connected layers represent a promising area of research as they aim to improve the efficiency and robustness of neural networks.

Answer 16

K-Means clustering is an unsupervised learning algorithm used to partition data points into K clusters based on similarity or distance measures. It aims to minimize the within-cluster variance and assigns each data point to the nearest centroid. On the other hand, K-Nearest Neighbors is a supervised learning algorithm used for classification and regression tasks. It predicts the class or value of a data point by considering the majority class or average value of its K nearest neighbors in the feature space. While K-Means clustering is used for clustering and segmentation, KNN is used for classification and regression tasks.

Answer 17

High-dimensional vector representations of words or entities learned from textual data using techniques such as Word2Vec, GloVe, or FastText. Each embedding vector captures semantic and syntactic information about the corresponding word or entity, encoding its meaning and context in a dense vector space. Embedding vectors enable the representation of words as continuous-valued vectors, facilitating natural language processing tasks such as word similarity calculation, document classification, and named entity recognition.

Answer 18

In natural language processing (NLP), featurized representations (word embeddings) convert words into vectors of numbers that capture their meanings, relationships, and context. Semantic Similarity: Words with similar meanings have similar vectors, allowing models to understand relationships between them (e.g., "cat" and "dog" would be closer in representation than "cat" and "airplane"). words appearing in similar contexts tend to have related meanings. Co-occurrence Matrices: Track how frequently words appear together in a window of text. Words occurring together often get similar vector representations. Neural Networks: Models like Word2Vec (skip-gram and CBOW) are trained to predict a word based on its surrounding context, or vice versa. The learned weights within these networks become the embeddings. AS i understand it there are no predefined lists of features. EAch word is cross referenced with each other word. And then we measure how often they appear next to each other, or in the same windows in sentence. The number we get is a strength of a feature T-SNE alghoritm

Answer 19

Traditional Recurrent Neural Networks (RNNs) can struggle with long-term dependencies. This means they have trouble retaining and using information from many timesteps in the past. This leads to the vanishing/exploding gradient problems during training. (In a standard RNN, the hidden state at each timestep is calculated by overwriting the entire previous cell state with new information. This makes it hard to control what information is retained or forgotten over multiple timesteps. This crude updating mechanism makes standard RNNs susceptible to vanishing/exploding gradients. Important past information might decay too quickly, while less relevant information could get amplified disproportionately. SEeective cell update: Selective Memory: Gates introduce a fine-grained control mechanism. Forget/Reset Gate: Allows the network to explicitly erase irrelevant information from the cell state, preventing it from cluttering the memory. Update Gate: Decides how much of the new input (combined with the past hidden state) should update the cell state, promoting the retention of relevant new information. GRNNs introduce "gates" – mechanisms that control the flow of information within their units. These gates help them selectively remember or forget information, improving their ability to handle long-term dependencies. Two common types of GRNNs exist, each with a slightly different set of gates: Long Short-Term Memory (LSTM) Reset Gate: Helps forget irrelevant information from the past. Update Gate: Decides how much of the past information to carry forward. Output Gate: Controls how much of the internal cell state to expose as output. Gated Recurrent Unit (GRU) Reset Gate: Similar to the LSTM's reset gate. Update Gate: Combines the LSTM's forget and update gates for slightly simpler computation. Gated RNNs have been widely used in natural language processing tasks, speech recognition, time series analysis, and sequential data modeling, where capturing temporal dependencies is crucial for accurate predictions.

Answer 20

Gradient boosting is a machine learning ensemble method used for regression and classification tasks. It builds a predictive model by sequentially training weak learners (e.g., decision trees) to correct the errors of the previous models. In each iteration, the algorithm fits a new model to the residual errors of the current ensemble and updates the ensemble by adding the new model with a scaled learning rate. Gradient boosting algorithms, such as XGBoost and LightGBM, are known for their high predictive accuracy and robustness.

Answer 21

An unsupervised machine learning algorithm used for partitioning data into k distinct clusters based on similarity or proximity of data points. The algorithm iteratively assigns data points to the nearest cluster centroid and updates the centroids to minimize the within-cluster sum of squared distances. K-means clustering is widely used for clustering analysis, data segmentation, and pattern recognition tasks.

Answer 22

Logistic regression is not a regression, but a classification learning algorithm. The name comes from statistics and is due to the fact that the mathematical formulation of logistic regression is similar to that of linear regression. At the time where the absence of computers required scientists to perform manual calculations, they were eager to find a linear classification model. They figured out that if we define a negative label as 0 and the positive label as 1, we would just need to find a simple continuous function whose codomain is (0,1). In such a case, if the value returned by the model for input x is closer to 0, then we assign a negative label to x; otherwise, the example is labeled as positive. One function that has such a property is the standard logistic function (also known as the sigmoid function):

Answer 23

Polynomial regression is a form of linear regression where the relationship between the independent variable x and the dependent variable y is modeled as an n-degree polynomial function. The polynomial regression model is expressed as: y = β0 + β1x + β2x2 + … + βnxn + ϵ, where n is the degree of the polynomial, β coefficients are the regression parameters, and ϵ represents the error term. Polynomial regression allows for capturing non-linear relationships between variables and is useful when the relationship is not well represented by a straight line.

Answer 24

A time series is a sequence of data points collected or recorded at successive time intervals. Time series data is often used to analyze and forecast trends, patterns, and behaviors over time. A time series model learns patterns and relationships from past observations and uses them to forecast future values. Various machine learning techniques can be applied to time series data, including traditional statistical methods (e.g., ARIMA, Exponential Smoothing), classical machine learning algorithms (e.g., Support Vector Machines, Random Forests), and deep learning models (e.g., Recurrent Neural Networks, Long Short-Term Memory networks). These models leverage the temporal dependencies present in the data to make accurate predictions, which are evaluated based on metrics such as mean squared error, mean absolute error, or forecast accuracy.

Answer 25

A simple and relatively low-performing models or algorithms that perform slightly better than random chance on a given learning task. In ensemble learning, weak learners are often combined to form a strong learner that achieves better predictive accuracy than any individual weak learner. Examples of weak learners include decision stumps (simple decision trees with only one split), perceptrons, and shallow neural networks.

Answer 26

Weights are usually initialized with random values at the start of training. Training is all about finding the right weights. Weights essentially represent the knowledge the model has learned from the training data. Weights are the adjustable parameters within a machine learning model that fundamentally determine how it learns and makes predictions. Weights are parameters that are assigned to the features (input variables) of a model during the training process. These weights determine the importance of each feature in making predictions. The model uses these weights to combine the features and produce an output. A higher weight means a stronger influence of one neuron on another. The weighted sum is then passed through an activation function (e.g., sigmoid, ReLU). This function adds nonlinearity, which is crucial for a neural network to learn complex patterns. It introduces a threshold-like behavior where the neuron "fires" (outputs a significant value) only if the weighted sum is large enough. The weights are continuously adjusted in response to the data patterns, with the goal of minimizing errors in the model's predictions. The process of adjusting the weights to minimize the difference between predicted and actual outputs is typically done through optimization algorithms such as gradient descent Think of a recipe where the ingredients are your input data. To make a tasty dish: Weights are like the amounts of each ingredient. The cooking process is like the neural network calculations. A chef learns by tasting the results (the errors) and adjusting the ingredient quantities (the weights) to get the perfect flavor. Weights act as knobs that the model "tweaks" during the learning process to find the optimal mapping between inputs and outputs. Backpropagation and algorithms like gradient descent are the driving force behind adjusting weights to optimize a machine learning model.

Answer 27

There are pretrained vectors on internet which we can download and use for context and analogies

Answer 28

Extreme Gradient Boosting is an optimized and scalable implementation of the gradient boosting algorithm, a popular ensemble learning method. XGBoost builds a strong ensemble of decision trees sequentially by minimizing a differentiable loss function using gradient descent optimization. It incorporates several regularization techniques to prevent overfitting and improve generalization performance. XGBoost is widely used for classification, regression, and ranking tasks and has won numerous machine learning competitions for its high predictive accuracy and efficiency.

Answer 29

Kernels (in CNNs): Purpose: Kernels are small matrices used in convolutional layers to detect patterns and extract features. They slide over the input data, performing calculations that highlight specific characteristics (edges, textures, shapes, etc.). Resizing: Kernels themselves don't directly resize the input. The output feature maps might have the same or different dimensions from the input, depending on factors like stride (the step size of the kernel movement) and padding. Changing Values: Kernels change the values in the output feature map by emphasizing patterns. A kernel designed to find edges might produce high values where edges are present, and low values elsewhere. Pooling Layers: Purpose: Pooling layers are designed to downsample feature maps, reducing their spatial size. This makes the network more computationally efficient and helps prevent overfitting. Resizing: Pooling layers explicitly resize the input by summarizing regions into smaller representations. Changing Values: While pooling might change specific values due to the downsampling calculation (e.g., max value, average), its primary focus is on reducing dimensionality, not modifying value patterns like kernels do. In Summary: Kernels detect and enhance features, changing values to make the patterns more pronounced. Pooling layers reduce the size of feature maps, making things computationally easier while preserving the most important information.

Answer 30

A type of deep learning model architecture primarily used in natural language processing (NLP) tasks. They are based on self-attention mechanisms that allow the model to weigh the importance of different input tokens when generating output representations. Transformers have achieved state-of-the-art performance in various NLP tasks, including language translation, text generation, and sentiment analysis, and are known for their parallelizability and scalability.

Answer 31

A component or subset of a larger predictive model that focuses on modeling a specific aspect or subset of the data. In machine learning, submodels are often used within ensemble methods, hierarchical models, or modular architectures to divide the modeling task into smaller, more manageable parts. Submodels can be trained independently or jointly with other components of the model and combined to make predictions or perform inference on the entire dataset.

Answer 32

A supervised learning algorithm used for classification and regression tasks. SVMs work by finding the optimal hyperplane that separates classes in the feature space while maximizing the margin between the classes. In classification, SVM aims to find the hyperplane that best separates data points into different classes, while in regression, it aims to find the hyperplane that best fits the data. SVMs are effective for high-dimensional data and can handle both linear and non-linear relationships using kernel methods. KERNEL TRICK

Answer 33

t-distributed stochastic neighbor embedding, is a machine learning algorithm used for visualizing high-dimensional data in a lower-dimensional space, typically 2D or 3D. It is particularly useful for exploratory data analysis, dimensionality reduction, and clustering.

Answer 34

A type of artificial neural network designed to process sequential data by maintaining an internal state (memory) that captures information from previous time steps. RNNs are characterized by feedback connections that allow information to persist and flow through the network over time. They are well-suited for tasks such as time series prediction, natural language processing, and speech recognition, where context and temporal dependencies are important.

Answer 35

They are a type of information filtering system that suggests items (products, movies, articles, etc.) or content that a user is likely to find relevant or interesting. They analyze past user behavior, preferences, and item characteristics to predict what a user might enjoy. Recommendation systems can be framed as different machine learning problem types: Classification: Predicting whether a user will like or dislike a specific item (e.g., thumbs up or thumbs down). Regression: Estimating a numerical rating a user might give to an item (e.g., on a 5-star scale). Ranking: Generating an ordered list of items most likely to be relevant to the user ML Techniques: Decision Trees, Random Forests, NN, NEarest NEighbour, MAtrix Factorization

Answer 36

An ensemble learning method used for classification and regression tasks. It constructs multiple decision trees during training and combines their predictions through averaging (for regression) or voting (for classification) to improve predictive accuracy and robustness. Each decision tree in the Random Forest is trained on a bootstrap sample of the original dataset, and random subsets of features are considered at each split. Random Forests are known for their high accuracy, scalability, and resistance to overfitting.

Answer 37

A decision tree is an acyclic graph that can be used to make decisions. In each branching node of the graph, a specific feature j of the feature vector is examined. If the value of the feature is below a specific threshold, then the left branch is followed; otherwise, the right branch is followed. As the leaf node is reached, the decision is made about the class to which the example belongs. a supervised learning algorithm used for classification and regression tasks. It recursively partitions the input space into regions based on feature values, creating a tree-like structure where each internal node represents a decision based on a feature, and each leaf node represents a class label or regression value. Decision trees are intuitive, interpretable, and can capture complex decision boundaries in the data.

Answer 38

A type of artificial neural network designed for processing structured grid data, such as images or spatial data. Convolutional networks leverage convolutional layers, pooling layers, and fully connected layers to learn hierarchical representations of input data. They are widely used in computer vision tasks, including image classification, object detection, and image segmentation. NN that significantly reduces number of parameters, withuot sacrificing too much of quality. They recognise regions of the same information

Answer 39

Conditioning signal refers to additional information provided to the generator or discriminator to guide the generation process. Conditioning signals can include class labels, attributes, or other latent variables that influence the generation of realistic samples. Conditioning signals enable GANs to generate diverse and controllable outputs tailored to specific attributes or characteristics desired by the user.

Answer 40

Bag of Words (BoW) is a simple and commonly used technique in natural language processing (NLP) for text representation. It involves treating a document as a collection of words or tokens and representing it as a sparse vector where each dimension corresponds to a unique word in the vocabulary, and the value represents the frequency of the word in the document. Bag of Words disregards word order and semantic information (such as?) but is effective for tasks such as text classification, document clustering, and information retrieval.

Answer 41

A machine learning ensemble method used for classification and regression tasks. It combines multiple weak learners (e.g., decision trees) into a strong learner by sequentially training each learner on a modified version of the dataset. AdaBoost assigns higher weights to misclassified data points in each iteration, forcing subsequent learners to focus on difficult examples. The final prediction is a weighted combination of individual learner predictions, where more accurate learners have higher influence.

Answer 42

Anomaly detection algorithms aim to distinguish between normal data points and anomalous ones by learning the typical characteristics of the data distribution. This is typically achieved through unsupervised learning techniques, as anomalies often lack labeled examples. Common approaches to anomaly detection include statistical methods, clustering algorithms, and supervised learning techniques such as isolation forests and one-class support vector machines (SVMs). Anomaly detection finds applications in various domains, including fraud detection, network security, system health monitoring, and industrial quality control, where identifying unusual patterns or behaviors is critical for maintaining integrity and reliability.

Answer 43

CycleGAN extends the GAN framework with a cleverly designed architecture and a cycle consistency constraint to tackle the challenging problem of image-to-image translation with unpaired data, while traditional GANs are mainly focused on learning a single generative model and often depend on paired examples for training. CycleGAN tackles the challenge of image-to-image translation without needing perfectly paired training data (like photos and their corresponding paintings). It works by pitting two generative models against each other in a game of artistic transformation. One model translates images from a source domain (e.g., real photos) to a target domain (e.g., artistic style). The other model does the reverse, translating from target back to source. Here's the twist: CycleGAN doesn't just train each model in isolation. It introduces a cycle consistency check. If an image from the source domain is transformed to the target style and then back again, it should ideally return close to the original image. This cycle consistency, enforced by a loss function, ensures the transformations are meaningful and maintain the content of the image while applying the new style. By working together and constantly checking each other's work, these generative models can learn to translate images between domains remarkably well, even without perfectly matched training data sets. Having 2 algoritm (one for each directipon) and discriminators on both sides stabilizes it considerably. Paired vs. Unpaired Data: Traditional GAN: Relies on paired training data, where you have examples from both your source and target domains in direct correspondence (e.g., a photo and its matching Monet-style painting). CycleGAN: Brilliantly overcomes this limitation, designed to work with unpaired data. You simply need sets of images from each domain, but they don't need to be direct translations of each other. Focus of Learning: Traditional GAN: Primarily focuses on learning a single generative model that can realistically create images from a given domain. CycleGAN: CycleGAN involves two generators working in tandem. Each learns a mapping between the source and target domain, with the added cycle consistency objective enforcing this bidirectional mapping to be meaningful. Training Mechanism: Traditional GAN: The adversarial game is between a single generator and a discriminator. The discriminator tries to distinguish between real and generated images, while the generator aims to fool the discriminator. CycleGAN: CycleGAN introduces additional loss terms beyond simple adversarial loss. The key addition is the cycle consistency loss, which ensures that translating an image from domain A to B, and then back to A, brings you close to the original image.

Answer 44

In natural language processing (NLP) involves understanding and solving analogical relationships between words or concepts. This involves recognizing semantic similarities and relationships between pairs of words and extending these relationships to find appropriate analogies. Analogy reasoning tasks are common in word embedding models and are used to evaluate their ability to capture semantic relationships between words.

Answer 45

ARIMA (AutoRegressive Integrated Moving Average) is a popular class of time series models used for forecasting and analyzing time-dependent data. It combines autoregressive (AR), differencing (I), and moving average (MA) components to capture both trend and seasonality in time series data. ARIMA models are widely employed in finance, economics, and other fields for tasks such as stock price prediction, demand forecasting, and anomaly detection.

Answer 46

Architecture-Related Number of Convolutional Layers Filter Size Stride Number of Filters Pooling type and Size Training Specific Learning Rate Batch Size Optimizer: Choice of algorithm (e.g., Adam, SGD with momentum, RMSprop) affects how the network updates its weights. Regularization Data-Related Input Image Size Data Augmentation: Techniques to increase data variability (rotations, flipping, noise, etc.) improve robustness and prevent overfitting.

Answer 47

In neural networks, Dense is a type of layer that represents a fully connected layer, also known as a fully connected layer or a dense layer. In a dense layer, each neuron or node is connected to every neuron in the previous layer, forming a dense matrix of connections. Dense layers can be used at various positions within a neural network architecture, depending on the specific task and architecture design. However, they are most commonly found towards the end of the network, especially in architectures designed for tasks such as classification or regression. Placing dense layers at the end of a neural network architecture allows for operations on the entirety of computations and features gathered earlier in the network. Having dense layers in a neural network can indeed increase the risk of overfitting, especially when dealing with complex datasets or architectures with a large number of parameters. Dense layers have the capacity to learn intricate patterns in the training data, including noise, which may not generalize well to unseen data. Dropout is a regularization technique commonly used to mitigate overfitting in neural networks, including those with dense layers.

Answer 48

Early stopping is a technique used during the training of machine learning models to prevent overfitting. Imagine training your model as baking a cake. You want it to bake long enough to be done, but not so long that it burns. Overfitting is like burning the cake – the model learns the training data too specifically, including noise, and performs worse on new data. Early stopping acts like a timer: it monitors the model's performance on a separate validation set. If performance starts to worsen, training is stopped, preserving the best version of the model before it starts to overfing. Early stopping is a regularization technique used to combat overfitting during the training of machine learning models. Here's how it works: Monitoring Validation Performance: In addition to the data used for training, a separate validation dataset is used. The model's performance on this validation set is monitored during each training iteration (epoch). Detecting Worsening Performance: If the model's performance on the validation set starts to degrade (e.g., error starts increasing), it's a signal that overfitting is likely beginning. Halting Training: Early stopping terminates the training process when this degradation is detected, even if the model could potentially keep improving on the training data. Key Idea: The goal is to preserve the model's state at the point where it generalizes best to unseen data, avoiding the over-specialization that leads to overfitt.

Answer 49

A feedforward network, also known as a multilayer perceptron (MLP), is a type of artificial neural network where connections between nodes do not form cycles (i.e., no feedback connections). , information flows in a single, forward direction from input nodes through hidden layers to output nodes, without any loops or recurrent connections ( Unlike recurrent neural networks (RNNs)). Feedforward networks are versatile and can be used for various machine learning tasks, including classification, regression, and function approximation. CNN's are FNN Input Layer: Input features are fed into the input layer neurons.Within each layer, neurons perform calculations:Take a weighted sum of inputs from the previous layer and apply an activation function (non-linearity) to introduce complexity and help the network learn patterns Hidden Layers: Input signals are propagated forward through one or more hidden layers, where each neuron applies a weighted sum of inputs and an activation function to produce an output. Output Layer: The output of the last hidden layer is passed to the output layer, where the final predictions are computed. FNN's Learn through backpropagation: During training, the errors from the output are used to calculate gradients. These gradients are propagated backwards through the layers to update the weights, making the network better at its task.

Answer 50

Gradient Boosting and XGBoost are both machine learning techniques used for supervised learning tasks, particularly for regression and classification. XGBoost is like gradient boosting on steroids, optimized for performance and scalability. It's become a go-to algorithm for many structured data problems. Gradient Boosting: Gradient Boosting is an ensemble learning technique where weak learners, typically decision trees, are trained sequentially, and each subsequent model corrects the errors made by the previous one. It optimizes a loss function by minimizing the residual errors at each step, using gradient descent. XGBoost: XGBoost (eXtreme Gradient Boosting) is a specific implementation of gradient boosting that is optimized for speed and performance. It includes several enhancements over traditional gradient boosting, such as a regularization term to control overfitting, a more efficient algorithm for splitting nodes, and support for parallel and distributed computing. XGBoost is known for its high accuracy and efficiency and has become a popular choice for competitions on platforms like Kaggle. Key Enhancements - Regularization: XGBoost heavily uses regularization to prevent overfitting: a)Penalties on model complexity (e.g., number of leaf nodes in trees) b) Shrinkage (scales down the contribution of each tree) - Efficient Tree Building: It introduces optimizations in the way it finds the best splits for trees: a) Approximate greedy algorithm for split finding b) Sparsity awareness (handling missing values effectively) - Parallel & Hardware Optimized: a) Parallelizes the tree construction process. b) Designed for efficient use of computer hardware (CPU cache awareness) - Second-Order Gradients: While regular gradient boosting uses first-order gradients, XGBoost utilizes second-order gradients to provide more information for its weight update process. XGBoost Advantages: XGBoost is often significantly faster than traditional gradient boosting implementations. Due to its regularizations and optimizations, it usually outperforms other gradient boosting algorithms in terms of accuracy. Its computational efficiency makes it well-suited for large datasets. When Gradient Boosting Might Still Be Better: If you need highly interpretable models, simpler gradient boosting implementations can be easier to understand. Also it is sensable to use it on smaller data sets. The overhead of XGBoost's optimizations might not be worth it for very small datasets.

Answer 51

Hidden Markov Models (HMMs) are statistical models used to describe sequences of observable events generated by underlying hidden states. In an HMM, the observed events form a sequence, while the hidden states represent the underlying, unobservable process that generates the observations. HMMs are characterized by two main components: transition probabilities, which describe the probability of transitioning between hidden states, and emission probabilities, which describe the probability of observing a particular event given a hidden state. HMMs are widely used in various applications such as speech recognition, natural language processing, bioinformatics, and time series analysis, where sequential data is prevalent, and the underlying structure is not directly observable. Hidden States: A sequence of underlying states the system moves through, but these states are not directly observable. Observations: At each timestep, you get an observation that depends on the current hidden state, but there's some probability involved. Real-World Examples Speech Recognition: The underlying hidden states are the phonemes or words being spoken, the observations are the noisy sound recordings. DNA Analysis: Hidden states represent different regions of DNA (coding, non-coding), observations are the sequences of letters (A, T, C, G). Stock Market Modeling: Hidden states represent market conditions (bull, bear), observations are the daily stock prices. Key Components of an HMM HMM are associated with: Hidden States: A set of possible states the system can be in. Observations: A set of possible symbols that can be observed. Transition Probabilities: The probability of moving from one hidden state to another. Emission Probabilities: The probability of observing a particular symbol in a given hidden state. Initial State Probabilities: The probability of the system starting in a particular state. How HMMs are Used Decoding: Given a sequence of observations, figuring out the most likely sequence of hidden states that generated it (e.g., figuring out the spoken words from sound recordings). Prediction: Predicting the likelihood of future observations based on previous ones. Learning: Adjusting the transition and emission probabilities to better fit observed data.

Answer 52

Standard softmax is often used in neural networks for classification, but when you have a huge number of potential output classes (like a massive vocabulary), it becomes computationally expensive. Calculating probabilities involves a big calculation over every single class. Hierarchical softmax tackles this by introducing a clever structure: Tree of Output Classes: Instead of a flat list of each class, it arranges them into a tree. Each leaf node on the tree corresponds to a single output class (e.g., a specific word). Path to Prediction: The probability of any specific word is calculated as the product of probabilities along the path from the tree's root to that word's leaf node. Why It's Faster: Instead of one giant calculation, predicting a word now involves a series of smaller decisions as you traverse down the tree (think a sequence of left or right turns). The computation time becomes proportional to the depth of the tree, which is much less than the total number of classes. Hierarchical softmax offers a significant speed improvement, especially for large vocabularies or classification problems with many possible outcomes. Also, it can more effectively handle infrequent words, as they have a defined path in the tree, unlike in standard softmax where they get overwhelmed by more common words.

Answer 53

A Markov chain is a model that describes a sequence of events where the probability of each event depends only on the state of the previous event. It's like having a system with "short-term memory." Imagine a frog hopping between lily pads. Its next hop only depends on which lily pad it's currently on, not where it was before. Markov chains are used to model processes that seem somewhat random but have some underlying patterns based on the current state, such as weather patterns, text generation, or stock market fluctuations. Markov models are all about the current state. The future depends only on the present, not the full history that led to it. This makes them relatively simple, as you don't need to track extensive past information. They are useful for modeling systems where the next state has a clear probabilistic dependence on the current state (weather patterns, board game moves). Bayesian approaches are all about belief updating. They start with a prior belief about something (a hypothesis, a parameter distribution) and continuously update this belief as new evidence (data) comes in. They incorporate existing knowledge or assumptions into the model. they Excel when you can factor your prior beliefs about a problem and want your model to learn incrementally over time. (Spam filtering, medical diagnosis). Simplified Analogy Markov: Weatherman who only looks at today's conditions for tomorrow's forecast. Bayesian: Weatherman who starts with climate averages, then continuously refines their forecast as each day's data arrives.

Answer 54

Core Layers: Dense (Fully Connected) Layers: The workhorse of many neural networks. Every neuron in a dense layer is connected to every neuron in the previous layer. Used for learning complex relationships between input features and for outputting final predictions. Convolutional Layers: Designed to extract local patterns from data, especially images. They apply small filters that slide over the input, detecting features like edges and textures. Crucial for computer vision tasks. Recurrent Layers (LSTM, GRU): Specialized for handling sequential data like text or time series. They maintain an internal memory to "remember" information from previous elements in the sequence, making them excellent for language modeling and tasks with temporal dependencies. Normalization Layers: Batch Normalization: Helps stabilize and speed up training by normalizing the activations of a layer across a batch of data. Reduces sensitivity to initialization and allows for higher learning rates. Layer Normalization: Similar to batch normalization, but normalizes across the features within a single example, helpful for specific tasks like natural language processing. Activation Layers: ReLU (Rectified Linear Unit): Very popular due to its simplicity and ability to prevent the vanishing gradient problem. It simply outputs the input if it's positive, otherwise outputs zero. Sigmoid: Maps input values between 0 and 1, often used for output layers in binary classification problems (predicting probabilities). Tanh: Similar to Sigmoid, but maps inputs between -1 and 1, sometimes helpful for certain tasks. Pooling Layers: Max Pooling: Downsamples feature maps by taking the maximum value within a sliding window, reducing dimensionality and making the network more robust to small data variations. Average Pooling: Similar to max pooling, but takes the average within the window. Other Specialized Layers: Dropout: A regularization technique that randomly drops neurons during training to prevent overfitting. Attention Mechanisms: Used in transformer architectures to allow the model to focus on important parts of the input sequence, crucial for advanced natural language processing tasks. Sparsly Connected Lyers:

Models & Elements Flashcards

(78 cards)