Models & Elements Flashcards

1
Q

Attention Mechanism

A

Enables the model to selectively focus on specific, relevant parts of the input data while making predictions or generating text.

A component commonly used in neural network architectures, particularly in natural language processing (NLP) and computer vision tasks. It enables models to focus on specific parts of input data dynamically, assigning different weights to different parts of the input sequence. This process allows the model to selectively attend to relevant information, enhancing its ability to capture long-range dependencies and handle variable-length input sequences effectively. Here’s how it works:

Encoding Input Features: input sequence is encoded into a sequence of feature vectors using an encoder network. Each feature vector represents a specific part of the input sequence and contains information about that part’s content.

Calculating Attention Weights: Next, the attention mechanism calculates attention weights for each feature vector in the input sequence. These weights determine the importance or relevance of each part of the input sequence with respect to the current context.

Weighted Sum: The attention weights are applied to the corresponding feature vectors, producing a weighted sum of the input sequence. This weighted sum emphasizes the parts of the input sequence that are deemed most relevant or informative for the current task or context.

Context Vector: The weighted sum of the input sequence, known as the context vector, is then passed to the subsequent layers of the neural network for further processing. This context vector encapsulates the most relevant information from the input sequence, enabling the model to make more informed predictions or decisions.

Training and Learning: During training, the attention mechanism learns to dynamically adjust the attention weights based on the input sequence and the task at hand. This learning process enables the model to adaptively focus on different parts of the input sequence as needed, improving its performance on various tasks such as machine translation, text summarization, and image captioning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Attention RNN

A

A specialized architecture within recurrent neural networks (RNNs) that selectively focuses on different parts of the input sequence during processing. Unlike traditional RNNs, which process sequences with a fixed-size internal state, Attention RNN dynamically adjusts its attention weights, allowing it to give more importance to relevant inputs while suppressing irrelevant ones. It is commonly used in natural language processing tasks such as machine translation and text summarization, where understanding context and relevance within sequences is crucial. For example, in machine translation, Attention RNN enables the model to align words from the source language to the target language more effectively by attending to specific words in the source sentence during translation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Auto-Encoders

A

Autoencoder is a feed-forward NN with encoder-decoder architecture. It is trained to reconstruct its input.

Autoencoders are a special type of neural network used in machine learning for unsupervised learning tasks. They are essentially designed to learn efficient representations of data by trying to compress the data and then recreate it.

  1. Structure: An autoencoder consists of two main parts: an encoder and a decoder.
    The encoder takes the input data (like an image or a text snippet) and compresses it into a lower-dimensional representation, often called the “latent space” or “code.” This code captures the essential features of the input data.
    The decoder then receives this compressed code and tries to reconstruct the original input data from it.
  2. Training: During training, the autoencoder is given a set of input data. It then tries to encode this data and decode it back, minimizing the difference between the original data and the reconstructed data. This forces the encoder to learn a good representation of the data in the latent space, since it needs this information to create an accurate reconstruction.
  3. Applications: Autoencoders have various applications because of their ability to learn data representations. Here are a few examples:
    a) Dimensionality reduction: By compressing data into a lower-dimensional latent space, autoencoders can be used to reduce the storage space needed for data or improve the efficiency of other algorithms.
    b) Denoising: Autoencoders can be trained to ignore noise in the data by focusing on reconstructing the underlying clean patterns. This can be useful for tasks like image denoising or filtering audio data.
    c) Anomaly detection: Since autoencoders learn what “normal” data looks like, they can identify deviations from the norm. This can be helpful for anomaly detection in areas like fraud detection or system health monitoring.

By learning compressed representations of data, autoencoders offer a powerful tool for various tasks in machine learning, especially when dealing with unlabeled data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Bagging (Ensemble models)

A

Bagging involves training multiple instances of the same learning algorithm on different subsets of the training data and then averaging the predictions to make the final prediction.

Process:
Bootstrap Sampling: Randomly sample subsets (with replacement) from the training data.
Model Training: Train a base model (e.g., decision tree) on each bootstrap sample.
Prediction Aggregation: Combine predictions from all base models, often by averaging (for regression) or voting (for classification).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

BERT

A

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer-based language model developed by Google. It leverages bidirectional context and transformer architecture to learn rich contextual representations of words and sentences from large corpora of text data. BERT has achieved state-of-the-art performance on various natural language processing (NLP) tasks, including question answering, sentiment analysis, named entity recognition, and language understanding tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Bi-directional RNN

A

Type of recurrent neural network architecture that processes input sequences in both forward and backward directions. By utilizing information from past and future timesteps simultaneously, bi-directional RNNs can capture contextual dependencies more effectively than traditional RNNs, making them well-suited for tasks such as sequence labeling, machine translation, and sentiment analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bias direction (NLM)

A

Bias direction in the context of word embeddings refers to the vector direction between the position of a word in the embedding space and the position where it ideally should be This direction signifies the deviation of the word’s representation in the embedding space from a bias-free position. Identifying bias direction involves analyzing the displacement of word vectors in relation to desired unbiased representations. Techniques for addressing bias direction include debiasing algorithms, which aim to adjust word embeddings to minimize biases while preserving semantic information.

specific orientation or vector space within the embedding where certain biases are encoded. Word embeddings, created through techniques like Word2Vec or GloVe, map words to high-dimensional vectors in a continuous vector space. Bias direction arises when these embeddings exhibit systematic biases toward certain concepts, genders, races, or other social categories present in the training data. For example, if certain occupations are predominantly associated with one gender in the training data, the embedding space might reflect this bias by positioning words related to those occupations closer to gender-specific words. Identifying bias direction involves analyzing the spatial relationships between word vectors and understanding which dimensions in the embedding space correspond to biased attributes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Bias in word embedings

A

Tendency of word embedding models to encode and reinforce societal biases present in the training data. Word embeddings are vector representations of words learned from large text corpora using techniques such as Word2Vec or GloVe. However, these embeddings may inadvertently capture stereotypes, prejudices, or cultural biases present in the text data, leading to biased representations of certain concepts or groups.

depending on attributes with withch each words is associated a word can be represented as a points in multidimentional space. The distance of words to sertain features can be measured as a vector. This vector has a magnitude and direction that tells us the strenght and type of relation between two words. Having this information we can deterrmine similar meanings between words or biases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Biases (NN)

A

Verry simplay way to understand it is:
Each neuron has a function which takes inputs and multiplies each with specific weight - the bias is an added value that tells us how much we want to “favor” a function created by neuron. The higher the bias the more likely it is the function created in neuron will be activated

Each neuron in a neural network has its own bias value, independent of the weights. Bias is essentially an intercept term similar to the intercept in linear regression. While weights determine the strength of the connections between neurons, biases allow neurons to adjust the output independently of the input. bias term, acts as an adjustable parameter that allows the neuron to output non-zero values even when all inputs are zero. This means the neuron can still fire and contribute to the network’s output even if the weighted sum of inputs is low or zero. Without a bias, the model’s output might always be confined to a certain range. Think of neurons as making decisions by ‘activating’ when the weighted sum plus bias exceeds a threshold. The bias can make it easier or harder for a neuron to activate, controlling its baseline behavior.

In a neuron, biases are added to the weighted sum of inputs before applying the activation function
weighted_sum = (w1 * x1) + (w2 * x2) + (w3 * x3) + b

While weights control the slope of the relationship between a neuron’s input and output, the bias determines the intercept, or where the line crosses the y-axis. In other words: biases shift the activation function (left or right in relation to y-axis). This is critical for increasing the flexibility to where the activation function can trigger. Without a bias, the model might be forced to pass a decision boundary through the origin. Biases allow the boundary to shift, enabling the model to fit different patterns. Think of predicting temperatures – even if all your input values are zero, a bias can allow you to model non-zero temperatures.

Biases are also learned parameters. They are initialized (usually to small values) and then updated during training through backpropagation, just like weights. Biases can sometimes aid in faster convergence during the training process. By starting with biases already set, the model might start closer to a good solution. Common initialization techniques include initializing biases to zeros, small random values, or using techniques like Xavier or He initialization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Boosting (Ensemble models)

A

Boosting sequentially trains a series of weak learners (e.g., shallow decision trees) and adjusts the weights of data points to emphasize the mistakes made by previous models.

Process:
Iterative Training: Train a series of weak models, each focusing on the mistakes of the previous ones.
Weight Adjustment: Assign higher weights to misclassified data points to make them more influential in subsequent iterations.
Model Combination: Combine predictions from all weak models, often by weighted averaging.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cascade classifier

A

A machine learning model used for object detection in images or video streams. It consists of a sequence of stages, each containing a classifier trained to detect specific features or patterns of interest (e.g., Haar-like features in Viola-Jones algorithm). Cascade classifiers are designed to efficiently reject negative samples at early stages and focus computation on regions likely to contain objects, enabling real-time performance in applications such as face detection and pedestrian detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cell state

A

in Long Short-Term Memory (LSTM) networks, the “state” refers to the current setup of parameters within the LSTM cell that determines what information is deemed important to retain for future time steps.

The cell state serves as a long-term memory store that enables the LSTM network to capture dependencies and patterns over extended sequences of data.This mechanism allows the network to retain relevant information and discard irrelevant information as it processes sequential data, helping it to make more accurate predictions or classifications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Conditional Random Fields (CRF)

A

A type of probabilistic graphical model used for modeling structured prediction tasks in machine learning and natural language processing. CRFs model the conditional probability distribution of output variables given input features and capture dependencies among neighboring variables in structured data, such as sequences, graphs, or grids. They are widely used for tasks like sequence labeling, named entity recognition, part-of-speech tagging, and information extraction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cost-sensitive learning

A

A machine learning paradigm that incorporates the differential costs of errors or misclassifications into the training process. In cost-sensitive learning, the objective is to minimize a loss function that considers the varying costs associated with different types of errors or outcomes. Cost-sensitive learning is particularly relevant in imbalanced classification problems, where the classes have unequal costs or misclassification penalties. It enables models to prioritize accurate predictions for minority classes or critical outcomes, enhancing their performance and applicability in real-world scenarios.

It is a field of study that is closely related to the field of imbalanced learning that is concerned with classification on datasets with a skewed class distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Definitional word (NLP)

A

Understanding the relationship between definitional words of concepts helps structure knowledge bases and ontologies. The words used in a definition can help determine the correct meaning of a word in a specific context. Automatically identifying definitional words can help extract definitions of terms from large amounts of text.

Definitional word refers to the core words or phrases that explain the essence of a concept and can be used to create its definition.
Example: If the concept is “dog”, definitional words might include “mammal”, “pet”, “bark”, “loyal”.

This focuses on the kinds of words typically found in formal definitions:
Genus: The broader category the term belongs to (“dog” is a type of “animal”).
Differentia: What distinguishes it from other members of the category (dogs “bark”, have “fur”).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Dense layer

A

Also known as a fully connected layer, is a type of layer where each neuron or node is connected to every neuron in the previous layer. Dense layers play a fundamental role in feedforward neural networks, where they perform linear transformations and apply activation functions to input data. They enable neural networks to learn complex mappings between input and output data by capturing non-linear relationships and hierarchical features. Dense layers are commonly used in deep learning architectures for tasks such as classification, regression, and function approximation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Feeding-forward or forward pass (in NN)

A

Process of passing input data through the network’s layers in a forward direction, from the input layer through the hidden layers to the output layer. During this process, each layer performs a series of computations, such as linear transformations and activation functions, to generate predictions or representations of the input data. Feeding-forward is the basic operation performed during both training and inference in neural networks.

Taking an input and passing it through the network’s layers from input to output.
Performing the calculations at each neuron (weighted sums, activation functions).
Producing the final output (prediction, classification, etc.).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Gated recurent unit

A

Capture long-range dependencies and handle vanishing or exploding gradient problems. Gated units incorporate gating mechanisms that control the flow of information within the network, allowing it to selectively retain or discard information at each time step.
Input Gate (i): Controls new input information’s influence on the cell state.It decides which parts of the current input should be used to update the cell state.
Forget Gate (f) (in LSTM): Decides which previous cell state information to keep or discard. Controls whether information from the previous time step should be forgotten or retained in the cell state.
Output Gate (o): Determines which cell state information should be output or used as the current output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Gates

A

In neural network architectures like LSTMs and GRUs, gates are specialized components that regulate the flow of information within the network by controlling how much information is passed along and retained at each time step. The most common types of gates include:

Input Gates: Input gates determine how much new information is added to the memory cell at each time step. They regulate the update of the memory cell state based on the current input and the previous hidden state.
Forget Gates: Forget gates control how much information from the previous memory cell state is retained or forgotten at each time step. They decide which information is relevant to retain and which can be discarded.
Output Gates: Output gates determine how much information from the current memory cell state is passed to the output at each time step. They control the information flow from the memory cell to the output of the network.

The gates output values between 0 and 1 due to the sigmoid function. A value of 0 means “block this information” and a value of 1 means “let this information pass through”. The gates perform element-wise multiplication with both the previous cell state and new information, allowing fine-grained control over what’s kept and what’s added.

Gates play a crucial role in addressing the challenges of capturing long-term dependencies and mitigating the vanishing gradient problem in recurrent neural networks, enabling them to effectively model sequential data and time series.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Generative Adversarial Networks (GANs)

A

Used in unsupervised machine learning (GANs learn directly from the structure of real data without requiring explicit labels), particularly for generating synthetic data samples that resemble real data. GANs consist of two neural networks: a generator and a discriminator, which are trained simultaneously in a competitive setting. The generator learns to generate realistic-looking data samples from random noise and gradually refining it to resemble real data. Discriminator learns to distinguish between real data samples and fake ones generated by the generator, improving its ability to spot the differences over time. Through adversarial training, GANs learn to generate high-quality, diverse data samples across various domains, such as images, text, and music, with applications in image synthesis, data augmentation, and creative AI.

Discriminator Training: The discriminator is exposed to both real data samples and fake samples generated by the generator. It learns to classify them as “real” or “fake” based on their features.
Generator Training: The generator attempts to fool the discriminator by generating increasingly realistic samples. It receives feedback from the discriminator, helping it improve its output quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

GloVe (Global Vectors)

A

GloVe (Global Vectors for Word Representation) is a word embedding technique used to represent words as dense vectors in a continuous vector space. GloVe learns word embeddings by analyzing the global co-occurrence statistics of words in large text corpora. It captures semantic relationships between words by considering their contextual usage patterns across the entire corpus. GloVe embeddings encode semantic similarities and syntactic relationships between words, making them useful for natural language processing tasks such as word similarity calculation, document classification, and sentiment analysis. GloVe embeddings are pre-trained on large corpora and are widely used in various machine learning applications.

The key principle is capturing the global co-occurrence statistics of words within a large text corpus. Here’s how it works in steps:

  1. Co-occurrence Matrix: GloVe starts by constructing a matrix where each row represents a word and each column represents another word. The value in each cell indicates how often the words appear together within a certain window size in the corpus.
  2. Focusing on Ratios: Instead of focusing solely on raw occurrence counts, GloVe considers the ratios of how often words co-occur with each other. This emphasizes the meaningful relationships between words beyond just their frequency.
  3. Loss Function: GloVe defines a loss function that aims to minimize the difference between the dot product of two word vectors and the logarithm of their co-occurrence probability in the corpus. Training the model tries to find vector representations that respect these global relationships observed in the data.

GloVe aims to create a vector space where the distances and angles between word vectors reflect their semantic relationships (similar words are closer, related words have meaningful angles between them). If the words “ice” and “steam” frequently co-occur with the word “water”, their respective vectors will end up being quite similar after GloVe training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Hidden layer

A

Hidden layers are the layers within a neural network that sit between the input layer (where your data enters) and the output layer (where the predictions or results are produced). What Makes Them “Hidden” is that they have no direct contact with the outside: They don’t directly receive data or send outputs outside the network. Hidden layers are where the network discovers intricate patterns and relationships within the data. What occurs in these layers is often hard to interpret, hence the “hidden” aspect.

Like other layers, hidden layers consist of artificial neurons. These neurons perform calculations:
- Take in data (either from the input layer or a previous hidden layer).
- Apply weights, biases, and an activation function.
- Pass on the transformed data to the next layer.

Each hidden layer builds upon the work of the previous one. This hierarchical structure allows the network to learn increasingly complex representations of the data. Hidden layers help break a problem into smaller, more manageable sub-problems that the network can solve incrementally. Activation functions in hidden layers introduce non-linearity, which is crucial for neural networks to model more than just simple linear relationships in data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Hidden state 𝑎

A

State (stage of informations gained) of the neurons or units in the hidden layers of the network.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Hyperbolic Tangent function (TanH)

A

A mathematical function that smoothly squashes input values between -1 and 1. It looks like a smoother, flattened version of the sigmoid function. TanH is a common activation function often used in hidden layers. TanH is sometimes preferred in RNNs due to its ability to mitigate vanishing/exploding gradient problems a bit better compared to sigmoid.

Why It’s Used in Neural Networks:
Zero-Centered Outputs: Unlike the sigmoid function (0 to 1 range), tanh conveniently outputs both positive and negative values. This can be useful for certain layers or problems. Some problems require the model to signal both increase and decrease relative to something. For example, predicting stock price movement (up or down) would benefit from negative outputs.
Stronger Gradients: Especially near zero, TanH tends to have larger gradients than the sigmoid function. This can sometimes lead to faster training convergence.
Mitigating Vanishing Gradients: While not totally immune, TanH can be a bit better at tackling vanishing gradient issues in some cases compared to sigmoid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

K-Nearest Neighbours

A

A simple and intuitive machine learning algorithm used for classification and regression tasks. Given a new data point, KNN predicts its class label or numerical value based on the majority vote or average of its k nearest neighbors in the training dataset. KNN relies on the assumption that similar data points tend to belong to the same class or have similar target values. It is a non-parametric and lazy learning algorithm that does not require training a model explicitly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Linear regression (model)

A

A supervised learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the input features and the target variable.

Linear regression does search for the line (or hyperplane in higher dimensions) that minimizes the sum of the squared distances (or residuals) between the observed data points and the predicted values on the line. This method is known as the method of least squares.

Linear regression models are commonly used for prediction and inference tasks, and they provide interpretable coefficients that indicate the strength and direction of the relationships between variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Long-term, Short-term memory unit

A

Traditional neural networks struggle to handle data with a sequential nature (e.g., text, time series). RNNs address this by having a “memory” mechanism to retain information from previous steps. But long sequences make it hard for RNNs to learn long-range dependencies. Information from earlier steps can fade away as it’s propagated.

LSTMs are an advanced type of RNN cell designed to overcome the vanishing gradient problem. They not just remember last states but they have ability to decide what parts are worth of remembering.

Key Components of an LSTM Unit:
1. Cell State: This is the “long-term memory” of the LSTM. It runs through the entire chain, with only minor interactions, keeping information flowing.
2. Gates: These are what make LSTMs special:
Forget Gate: Selectively decides what information from the previous cell state should be discarded.
Input Gate: Determines what new information from the current input should be added to the cell state.
Output Gate: Controls which parts of the updated cell state become part of the output.

How it Works (Simplified)
a) The forget gate looks at the previous hidden state and current input and decides what old information to keep.
b) The input gate processes the current input and creates a “candidate” for updating the cell state.
c) The cell state is updated by combining parts of the old state (what the forget gate didn’t discard) and the new candidate values.
d) The output gate selects relevant parts of the cell state to generate an output.

At their core, LSTMs are neural network layers with a complex internal structure. This includes the cell state and the three gates (forget, input, and output). The gates contain sigmoid and hyperbolic tangent activation functions. Each gate and the calculations for updating the cell state involve sets of weights and biases. These are just like the weights and biases found in other parts of a neural network. LSTMs are trained as part of the overall neural network using the same principles of gradient descent and backpropagation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Naive Bayes Classifier

A

A probabilistic machine learning model based on Bayes’ theorem and the assumption of conditional independence between features. It calculates the probability of each class label given a set of input features and selects the class label with the highest probability as the predicted label for the input. Despite its simplicity and the naive assumption of feature independence, Naive Bayes classifiers are widely used for text classification, spam filtering, and other classification tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Named Entity Recognition (NER)

A

NER is a subfield of Natural Language Processing (NLP) focused on automatically identifying and classifying specific entities within a body of text. These are predefined categories like: People, Organizations (e.g., Google), Locations (e.g., France), Dates & Times (e.g., July 4th, 2023), Quantities (e.g., $1 Million), … and even custom entity types for your specific application.

Various ML algorithms are used for NER, including Traditional ML like Conditional Random Fields (CRFs), Support Vector Machines (SVMs), Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), Transformers. The NER model predicts the type of entity each word or group of words represents (or indicates that it’s not an entity).

Why NER is important:
* NER Extracts structured data from unstructured text which unlocks many applications. it helps tasks like machine translation, question answering, and text summarization.
* Business Applications include: Customer support chatbots can identify key issues and people mentioned. Analyzing legal documents to extract contract terms. Monitoring news feeds for relevant company or market trends.

Challenges
* Ambiguity: Words can belong to different categories depending on context (e.g., ‘Apple’ could be a company or a fruit).
* New Entities: Models need to be adaptable to handle previously unseen entities.

Typically, the output of an NER system might look like this:
Original Text: “John Doe visited Paris on July 4th, 2023 and met with the CEO of Acme Inc.”
NER Output:
* John Doe (Person)
* Paris (Location)
* July 4th, 2023 (Date)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Neuron (NN)

A

A neuron is the most basic processing unit within an artificial neural network. The concept of artificial neurons in neural networks is loosely inspired by biological neurons in the brain. Biological neurons receive signals (inputs) through connections called dendrites, process them, and send an output signal through the axon if a certain threshold is met.

Neural networks learn by adjusting the weights and biases during training. The goal is to find the optimal values that produce the desired output given a specific input. Artificial neural networks are organized into layers: an input layer, one or more hidden layers, and an output layer. Neurons in one layer are connected to neurons in the next, creating a complex network of calculations.

In a neural network, a neuron is a mathematical function that performs the following:
1) Inputs: A neuron receives multiple input values. These inputs could come from raw data (e.g., pixel values of an image) or be the outputs of neurons from a previous layer in the neural network.
2) Weights: Each input is multiplied by a corresponding weight. Weights are like knobs that determine how much influence each input has on the neuron’s output.
3) Summation: The weighted inputs are summed together.
4) Bias: A bias term is added to the sum. The bias is like an adjustment that helps the neuron learn how much we want to activate this neuron
5) Activation Function: The result of the summation (and bias) is passed through a non-linear activation function. This function introduces non-linearity into the model, which is essential for neural networks to learn complex patterns. Common activation functions include:
- Sigmoid
- Tanh
- ReLU (Rectified Linear Unit)
6) Output: The output of the activation function is the final output of the neuron. This output can then be sent to neurons in the next layer of the neural network.

Simple Analogy
Imagine a neuron like a decision-maker. Consider the decision of whether to wear a coat outside:

1) Inputs: Temperature, wind speed, likelihood of rain.
2) Weights: How heavily you weigh each factor (you might care more about temperature than wind, etc.)
3) Bias: Your general predisposition towards wearing a coat (some people are more likely to get cold).
4) Activation Function: Your mental model deciding if the combined factors cross a threshold for putting on a coat.
5) Output: The decision – coat or no coat.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Neutralization (bias)

A

sometimes called debiasing. Real-world data often contains biases reflecting social prejudices or historical patterns of discrimination.
ML models trained on this biased data learn and perpetuate these biases, resulting in unfair or harmful predictions. Neutralization is a collection of techniques aimed at reducing the influence of these unwanted biases in ML models.

Approaches to Neutralization
Pre-processing: Modifying the training data to be more balanced or remove sensitive attributes.
In-processing: Changing the model’s training process:
Regularization terms to penalize reliance on biased features.
Adversarial learning setups where a part of the model tries to identify biases to help another part counteract them.
Post-processing: Adjusting model outputs to ensure fairness according to specific metrics.

Most importandly in NLP:
These vector representations of words, which are foundational for many NLP tasks, can capture societal biases. For example, “doctor” might be closer to “man” and “nurse” closer to “woman” in the embedding space.

Debiasing Techniques in NLP
Data Pre-processing
Balanced Corpora: Curating datasets that have more balanced representation of different groups or perspectives.
Data Augmentation: Generating synthetic examples to counterbalance underrepresented groups or viewpoints.

Embedding Debiasing
Geometric Techniques: Realigning word embeddings in the vector space to mitigate biased associations.
Contextualized Embeddings: Instead of static word vectors, using models like BERT that dynamically generate embeddings based on the surrounding sentence, reducing some forms of bias.

Model-level Adjustments
Adversarial Training: Using a setup where one part of the model tries to predict a protected attribute (like gender) from the text, and the other part tries to perform the main task without relying on that protected attribute.
Fairness-aware Regularization: Adding terms to the loss function that penalize biased predictions across groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

One versus the rest

A

If you have multiclass problem but binary classification alghoritm. We crate an number of classifiers. Iin each we leave the label of target class and turn other classes to zero. We build a model for each class (1, 2, 3 ,…) beeing a target class. WE make final prediction by putting data through each model. we get 3 probabilities. the highest wins.

33
Q

One-Class Classification

A

One-Class Classification, unary classification or class modeling tries to identify object of only that one class among many other object classes. These classifiers are used for outlier detection, anomaly detection and novelty detection. Common examples are: one-class Gausian, one-class k-means, one-class kNN, one-class SVM.

34
Q

Pre-trained word embeddings

A

Pre-trained word embeddings are a foundational concept in Natural Language Processing (NLP). In essence, they map words and phrases from a human vocabulary to numerical vectors. These vectors aren’t just random but are learned on massive text datasets in a way that captures semantic and syntactic relationships between words. Imagine words plotted in a high-dimensional space – pre-trained embeddings ensure that words like “cat” and “dog” are closer together than “cat” and “airplane”. This allows NLP models to understand the context and nuances of language. Pre-trained models like Word2Vec or GloVe are a form of transfer learning, where knowledge learned from a massive text corpus can be leveraged to boost the performance of new NLP tasks, even on smaller datasets.

35
Q

Pooling layer

A

Pooling layers are like mini summarizers that shrink the size of data while keeping important features. Imagine looking at an image and identifying the overall shapes and edges, rather than getting hung up on every tiny detail. Pooling works similarly, by applying a filter (often a 2x2 square) that slides across the data, summarizing the information within each window. There are different pooling operations, like averaging or taking the maximum value, to capture the most important essence of that local area. This reduction in size makes the data easier to manage for the network, reduces the number of calculations needed, and helps the network focus on broader patterns instead of getting bogged down in precise details that might not be critical for the task at hand.

Imagine the feature map as a grid, and the pooling layer has a small sliding window (e.g., 2x2). As this window moves across the grid, it applies a specific operation like max pooling (taking the highest value within the window) or average pooling (calculating the mean). This process distills the most significant information from each region, decreasing the data size without losing crucial patterns. By doing this, pooling layers make the network computationally lighter, decrease the chance of overfitting (getting too fixated on minor details), and help the CNN focus on larger, more relevant features for its image recognition or classification tasks.

Pooling layer usually follows convolutionlayer

36
Q

Queery by comitee

A

Used in active learning. We train multiple models and then ask expert to label only those examples about which models disagree the most.

Imagine having a committee of undecided learners (algorithms) all trained on the same data set. When faced with a new, unlabeled data point, QBC doesn’t ask a single learner for its guess. Instead, it pits two learners against each other. If they disagree on how to classify the data point (because it falls in a confusing area for them), then QBC assumes this point holds valuable information for improving everyone’s learning. This “disagreement” becomes the query, prompting the labeling of that specific data point. By focusing on points where learners are unsure, QBC efficiently selects the most informative data for labeling, ultimately leading to a better-trained committee (and all the individual learners within it).

37
Q

Random forest vs. ADA Boost vs Gradient Boost

A

Random Forest: Builds multiple decision trees independently and in parallel. Final prediction is based on the majority vote (classification) or average (regression) across trees.

AdaBoost: Creates a sequence of weak learners (often decision stumps), where each subsequent learner focuses on correcting the mistakes of the previous one. Final prediction is a weighted combination of the weak learners.

Gradient Boost: Similar to AdaBoost, it trains a series of weak learners sequentially. Each learner aims to correct the residual errors of the previous ensemble. AdaBoost iteratively increases the weight of misclassified examples, forcing subsequent learners to focus on difficult cases, while Gradient Boost directly trains new learners to predict the residuals (errors) of the current ensemble, progressively refining the predictions. This makes Gradient Boost a more general algorithm that can optimize various loss functions, often leading to higher accuracy but also a greater susceptibility to overfitting compared to AdaBoost.

38
Q

Self-learning

A

Type of semi-supervised learning. We build model using labeled examples. Then we use that model to label unlabeled examples. IF the confidence score of label meets the threshold we add it to the data set. Then we rebuild the model iterativly unil we have the whole data set labelled. Unfortunatylly these models are not very accurate

39
Q

Sparsely Connected Layer

A

A Sparsely Connected Layer (SCL) is a type of neural network layer where not every neuron is connected to every neuron in the previous layer. This contrasts with traditional fully-connected layers where all neurons have connections. Here’s why SCLs matter:

Reduced Complexity: Fewer connections mean smaller models, faster computation, and less memory requirement – ideal for resource-constrained scenarios like mobile devices.
Potential for Overfitting Reduction: Sparse connections can act as a form of regularization, potentially preventing models from overfitting to the training data.
Biological Inspiration: SCLs are loosely inspired by the brain, where neurons are not fully interconnected either.
Finding Optimal Sparsity: A key challenge with SCLs is finding the right level of sparsity and the best connection patterns. This may involve techniques like pruning less important connections or using algorithms designed to discover optimal sparse structures.

Overall, sparsely connected layers represent a promising area of research as they aim to improve the efficiency and robustness of neural networks.

40
Q

Differnace between K-Means Cluster and K-Nearest Neighbours

A

K-Means clustering is an unsupervised learning algorithm used to partition data points into K clusters based on similarity or distance measures. It aims to minimize the within-cluster variance and assigns each data point to the nearest centroid. On the other hand, K-Nearest Neighbors is a supervised learning algorithm used for classification and regression tasks. It predicts the class or value of a data point by considering the majority class or average value of its K nearest neighbors in the feature space. While K-Means clustering is used for clustering and segmentation, KNN is used for classification and regression tasks.

41
Q

Embeding vectors

A

High-dimensional vector representations of words or entities learned from textual data using techniques such as Word2Vec, GloVe, or FastText. Each embedding vector captures semantic and syntactic information about the corresponding word or entity, encoding its meaning and context in a dense vector space. Embedding vectors enable the representation of words as continuous-valued vectors, facilitating natural language processing tasks such as word similarity calculation, document classification, and named entity recognition.

42
Q

Featurized representation (word embedding)

A

In natural language processing (NLP), featurized representations (word embeddings) convert words into vectors of numbers that capture their meanings, relationships, and context.
Semantic Similarity: Words with similar meanings have similar vectors, allowing models to understand relationships between them (e.g., “cat” and “dog” would be closer in representation than “cat” and “airplane”). words appearing in similar contexts tend to have related meanings.
Co-occurrence Matrices: Track how frequently words appear together in a window of text. Words occurring together often get similar vector representations.
Neural Networks: Models like Word2Vec (skip-gram and CBOW) are trained to predict a word based on its surrounding context, or vice versa. The learned weights within these networks become the embeddings.

AS i understand it there are no predefined lists of features. EAch word is cross referenced with each other word. And then we measure how often they appear next to each other, or in the same windows in sentence. The number we get is a strength of a feature

T-SNE alghoritm

43
Q

gated RNN

A

Traditional Recurrent Neural Networks (RNNs) can struggle with long-term dependencies. This means they have trouble retaining and using information from many timesteps in the past. This leads to the vanishing/exploding gradient problems during training.

(In a standard RNN, the hidden state at each timestep is calculated by overwriting the entire previous cell state with new information. This makes it hard to control what information is retained or forgotten over multiple timesteps. This crude updating mechanism makes standard RNNs susceptible to vanishing/exploding gradients. Important past information might decay too quickly, while less relevant information could get amplified disproportionately.

SEeective cell update: Selective Memory: Gates introduce a fine-grained control mechanism.
Forget/Reset Gate: Allows the network to explicitly erase irrelevant information from the cell state, preventing it from cluttering the memory.
Update Gate: Decides how much of the new input (combined with the past hidden state) should update the cell state, promoting the retention of relevant new information.

GRNNs introduce “gates” – mechanisms that control the flow of information within their units. These gates help them selectively remember or forget information, improving their ability to handle long-term dependencies.

Two common types of GRNNs exist, each with a slightly different set of gates:

Long Short-Term Memory (LSTM)
Reset Gate: Helps forget irrelevant information from the past.
Update Gate: Decides how much of the past information to carry forward.
Output Gate: Controls how much of the internal cell state to expose as output.

Gated Recurrent Unit (GRU)
Reset Gate: Similar to the LSTM’s reset gate.
Update Gate: Combines the LSTM’s forget and update gates for slightly simpler computation.

Gated RNNs have been widely used in natural language processing tasks, speech recognition, time series analysis, and sequential data modeling, where capturing temporal dependencies is crucial for accurate predictions.

44
Q

Gradient boost

A

Gradient boosting is a machine learning ensemble method used for regression and classification tasks. It builds a predictive model by sequentially training weak learners (e.g., decision trees) to correct the errors of the previous models. In each iteration, the algorithm fits a new model to the residual errors of the current ensemble and updates the ensemble by adding the new model with a scaled learning rate. Gradient boosting algorithms, such as XGBoost and LightGBM, are known for their high predictive accuracy and robustness.

45
Q

K-means cluster

A

An unsupervised machine learning algorithm used for partitioning data into k distinct clusters based on similarity or proximity of data points. The algorithm iteratively assigns data points to the nearest cluster centroid and updates the centroids to minimize the within-cluster sum of squared distances. K-means clustering is widely used for clustering analysis, data segmentation, and pattern recognition tasks.

46
Q

Logistic Regression

A

Logistic regression is not a regression, but a classification learning algorithm. The name comes from statistics and is due to the fact that the mathematical formulation of logistic regression is similar to that of linear regression.

At the time where the absence of computers required scientists to perform manual calculations, they were eager to find a linear classification model. They figured out that if we define a negative label as 0 and the positive label as 1, we would just need to find a simple continuous function whose codomain is (0,1). In such a case, if the value returned by the model for input x is closer to 0, then we assign a negative label to x; otherwise, the example is labeled as positive. One function that has such a property is the standard logistic function (also known as the sigmoid function):

47
Q

Polynomial regression

A

Polynomial regression is a form of linear regression where the relationship between the independent variable x and the dependent variable y is modeled as an n-degree polynomial function.
The polynomial regression model is expressed as:

y = β0 + β1x + β2x2 + … + βnxn + ϵ,

where n is the degree of the polynomial, β coefficients are the regression parameters, and ϵ represents the error term. Polynomial regression allows for capturing non-linear relationships between variables and is useful when the relationship is not well represented by a straight line.

48
Q

Time-Series

A

A time series is a sequence of data points collected or recorded at successive time intervals. Time series data is often used to analyze and forecast trends, patterns, and behaviors over time. A time series model learns patterns and relationships from past observations and uses them to forecast future values. Various machine learning techniques can be applied to time series data, including traditional statistical methods (e.g., ARIMA, Exponential Smoothing), classical machine learning algorithms (e.g., Support Vector Machines, Random Forests), and deep learning models (e.g., Recurrent Neural Networks, Long Short-Term Memory networks). These models leverage the temporal dependencies present in the data to make accurate predictions, which are evaluated based on metrics such as mean squared error, mean absolute error, or forecast accuracy.

49
Q

Weak learners

A

A simple and relatively low-performing models or algorithms that perform slightly better than random chance on a given learning task. In ensemble learning, weak learners are often combined to form a strong learner that achieves better predictive accuracy than any individual weak learner. Examples of weak learners include decision stumps (simple decision trees with only one split), perceptrons, and shallow neural networks.

50
Q

Weights

A

Weights are usually initialized with random values at the start of training. Training is all about finding the right weights. Weights essentially represent the knowledge the model has learned from the training data. Weights are the adjustable parameters within a machine learning model that fundamentally determine how it learns and makes predictions. Weights are parameters that are assigned to the features (input variables) of a model during the training process. These weights determine the importance of each feature in making predictions. The model uses these weights to combine the features and produce an output. A higher weight means a stronger influence of one neuron on another.

The weighted sum is then passed through an activation function (e.g., sigmoid, ReLU). This function adds nonlinearity, which is crucial for a neural network to learn complex patterns. It introduces a threshold-like behavior where the neuron “fires” (outputs a significant value) only if the weighted sum is large enough.

The weights are continuously adjusted in response to the data patterns, with the goal of minimizing errors in the model’s predictions. The process of adjusting the weights to minimize the difference between predicted and actual outputs is typically done through optimization algorithms such as gradient descent

Think of a recipe where the ingredients are your input data. To make a tasty dish:

Weights are like the amounts of each ingredient.
The cooking process is like the neural network calculations.
A chef learns by tasting the results (the errors) and adjusting the ingredient quantities (the weights) to get the perfect flavor.

Weights act as knobs that the model “tweaks” during the learning process to find the optimal mapping between inputs and outputs.
Backpropagation and algorithms like gradient descent are the driving force behind adjusting weights to optimize a machine learning model.

51
Q

Word vectors (context)

A

There are pretrained vectors on internet which we can download and use for context and analogies

52
Q

XGBoost

A

Extreme Gradient Boosting is an optimized and scalable implementation of the gradient boosting algorithm, a popular ensemble learning method. XGBoost builds a strong ensemble of decision trees sequentially by minimizing a differentiable loss function using gradient descent optimization. It incorporates several regularization techniques to prevent overfitting and improve generalization performance. XGBoost is widely used for classification, regression, and ranking tasks and has won numerous machine learning competitions for its high predictive accuracy and efficiency.

53
Q

Kernel vs Pooling

A

Kernels (in CNNs):

Purpose: Kernels are small matrices used in convolutional layers to detect patterns and extract features. They slide over the input data, performing calculations that highlight specific characteristics (edges, textures, shapes, etc.).
Resizing: Kernels themselves don’t directly resize the input. The output feature maps might have the same or different dimensions from the input, depending on factors like stride (the step size of the kernel movement) and padding.
Changing Values: Kernels change the values in the output feature map by emphasizing patterns. A kernel designed to find edges might produce high values where edges are present, and low values elsewhere.
Pooling Layers:

Purpose: Pooling layers are designed to downsample feature maps, reducing their spatial size. This makes the network more computationally efficient and helps prevent overfitting.
Resizing: Pooling layers explicitly resize the input by summarizing regions into smaller representations.
Changing Values: While pooling might change specific values due to the downsampling calculation (e.g., max value, average), its primary focus is on reducing dimensionality, not modifying value patterns like kernels do.
In Summary:

Kernels detect and enhance features, changing values to make the patterns more pronounced.
Pooling layers reduce the size of feature maps, making things computationally easier while preserving the most important information.

54
Q

Transformers

A

A type of deep learning model architecture primarily used in natural language processing (NLP) tasks. They are based on self-attention mechanisms that allow the model to weigh the importance of different input tokens when generating output representations. Transformers have achieved state-of-the-art performance in various NLP tasks, including language translation, text generation, and sentiment analysis, and are known for their parallelizability and scalability.

55
Q

Submodel

A

A component or subset of a larger predictive model that focuses on modeling a specific aspect or subset of the data. In machine learning, submodels are often used within ensemble methods, hierarchical models, or modular architectures to divide the modeling task into smaller, more manageable parts. Submodels can be trained independently or jointly with other components of the model and combined to make predictions or perform inference on the entire dataset.

56
Q

SVM

A

A supervised learning algorithm used for classification and regression tasks. SVMs work by finding the optimal hyperplane that separates classes in the feature space while maximizing the margin between the classes. In classification, SVM aims to find the hyperplane that best separates data points into different classes, while in regression, it aims to find the hyperplane that best fits the data. SVMs are effective for high-dimensional data and can handle both linear and non-linear relationships using kernel methods. KERNEL TRICK

57
Q

t-SNE

A

t-distributed stochastic neighbor embedding, is a machine learning algorithm used for visualizing high-dimensional data in a lower-dimensional space, typically 2D or 3D. It is particularly useful for exploratory data analysis, dimensionality reduction, and clustering.

58
Q

Recurrent Neural Network

A

A type of artificial neural network designed to process sequential data by maintaining an internal state (memory) that captures information from previous time steps. RNNs are characterized by feedback connections that allow information to persist and flow through the network over time. They are well-suited for tasks such as time series prediction, natural language processing, and speech recognition, where context and temporal dependencies are important.

59
Q

Recomendation system

A

They are a type of information filtering system that suggests items (products, movies, articles, etc.) or content that a user is likely to find relevant or interesting. They analyze past user behavior, preferences, and item characteristics to predict what a user might enjoy.

Recommendation systems can be framed as different machine learning problem types:
Classification: Predicting whether a user will like or dislike a specific item (e.g., thumbs up or thumbs down).
Regression: Estimating a numerical rating a user might give to an item (e.g., on a 5-star scale).
Ranking: Generating an ordered list of items most likely to be relevant to the user

ML Techniques: Decision Trees, Random Forests, NN, NEarest NEighbour, MAtrix Factorization

60
Q

Random Forrest

A

An ensemble learning method used for classification and regression tasks. It constructs multiple decision trees during training and combines their predictions through averaging (for regression) or voting (for classification) to improve predictive accuracy and robustness. Each decision tree in the Random Forest is trained on a bootstrap sample of the original dataset, and random subsets of features are considered at each split. Random Forests are known for their high accuracy, scalability, and resistance to overfitting.

61
Q

Decision Tree

A

A decision tree is an acyclic graph that can be used to make decisions. In each branching node of the graph, a specific feature j of the feature vector is examined. If the value of the feature is below a specific threshold, then the left branch is followed; otherwise, the right branch is followed. As the leaf node is reached, the decision is made about the class to which the example belongs.

a supervised learning algorithm used for classification and regression tasks. It recursively partitions the input space into regions based on feature values, creating a tree-like structure where each internal node represents a decision based on a feature, and each leaf node represents a class label or regression value. Decision trees are intuitive, interpretable, and can capture complex decision boundaries in the data.

62
Q

Convolutional Network

A

A type of artificial neural network designed for processing structured grid data, such as images or spatial data. Convolutional networks leverage convolutional layers, pooling layers, and fully connected layers to learn hierarchical representations of input data. They are widely used in computer vision tasks, including image classification, object detection, and image segmentation.

NN that significantly reduces number of parameters, withuot sacrificing too much of quality. They recognise regions of the same information

63
Q

Conditioning signal (GAN game)

A

Conditioning signal refers to additional information provided to the generator or discriminator to guide the generation process. Conditioning signals can include class labels, attributes, or other latent variables that influence the generation of realistic samples. Conditioning signals enable GANs to generate diverse and controllable outputs tailored to specific attributes or characteristics desired by the user.

64
Q

Bag of words

A

Bag of Words (BoW) is a simple and commonly used technique in natural language processing (NLP) for text representation. It involves treating a document as a collection of words or tokens and representing it as a sparse vector where each dimension corresponds to a unique word in the vocabulary, and the value represents the frequency of the word in the document. Bag of Words disregards word order and semantic information (such as?) but is effective for tasks such as text classification, document clustering, and information retrieval.

65
Q

Ada-Boost

A

A machine learning ensemble method used for classification and regression tasks. It combines multiple weak learners (e.g., decision trees) into a strong learner by sequentially training each learner on a modified version of the dataset. AdaBoost assigns higher weights to misclassified data points in each iteration, forcing subsequent learners to focus on difficult examples. The final prediction is a weighted combination of individual learner predictions, where more accurate learners have higher influence.

66
Q

Anomaly Detection

A

Anomaly detection algorithms aim to distinguish between normal data points and anomalous ones by learning the typical characteristics of the data distribution. This is typically achieved through unsupervised learning techniques, as anomalies often lack labeled examples. Common approaches to anomaly detection include statistical methods, clustering algorithms, and supervised learning techniques such as isolation forests and one-class support vector machines (SVMs). Anomaly detection finds applications in various domains, including fraud detection, network security, system health monitoring, and industrial quality control, where identifying unusual patterns or behaviors is critical for maintaining integrity and reliability.

67
Q

CycleGAN

A

CycleGAN extends the GAN framework with a cleverly designed architecture and a cycle consistency constraint to tackle the challenging problem of image-to-image translation with unpaired data, while traditional GANs are mainly focused on learning a single generative model and often depend on paired examples for training. CycleGAN tackles the challenge of image-to-image translation without needing perfectly paired training data (like photos and their corresponding paintings). It works by pitting two generative models against each other in a game of artistic transformation. One model translates images from a source domain (e.g., real photos) to a target domain (e.g., artistic style). The other model does the reverse, translating from target back to source. Here’s the twist: CycleGAN doesn’t just train each model in isolation. It introduces a cycle consistency check. If an image from the source domain is transformed to the target style and then back again, it should ideally return close to the original image. This cycle consistency, enforced by a loss function, ensures the transformations are meaningful and maintain the content of the image while applying the new style. By working together and constantly checking each other’s work, these generative models can learn to translate images between domains remarkably well, even without perfectly matched training data sets. Having 2 algoritm (one for each directipon) and discriminators on both sides stabilizes it considerably.

Paired vs. Unpaired Data:
Traditional GAN: Relies on paired training data, where you have examples from both your source and target domains in direct correspondence (e.g., a photo and its matching Monet-style painting).
CycleGAN: Brilliantly overcomes this limitation, designed to work with unpaired data. You simply need sets of images from each domain, but they don’t need to be direct translations of each other.

Focus of Learning:
Traditional GAN: Primarily focuses on learning a single generative model that can realistically create images from a given domain.
CycleGAN: CycleGAN involves two generators working in tandem. Each learns a mapping between the source and target domain, with the added cycle consistency objective enforcing this bidirectional mapping to be meaningful.

Training Mechanism:
Traditional GAN: The adversarial game is between a single generator and a discriminator. The discriminator tries to distinguish between real and generated images, while the generator aims to fool the discriminator.
CycleGAN: CycleGAN introduces additional loss terms beyond simple adversarial loss. The key addition is the cycle consistency loss, which ensures that translating an image from domain A to B, and then back to A, brings you close to the original image.

68
Q

Analogy reasoning (NLP)

A

In natural language processing (NLP) involves understanding and solving analogical relationships between words or concepts. This involves recognizing semantic similarities and relationships between pairs of words and extending these relationships to find appropriate analogies. Analogy reasoning tasks are common in word embedding models and are used to evaluate their ability to capture semantic relationships between words.

69
Q

ARIMA (Time Series Models)

A

ARIMA (AutoRegressive Integrated Moving Average) is a popular class of time series models used for forecasting and analyzing time-dependent data. It combines autoregressive (AR), differencing (I), and moving average (MA) components to capture both trend and seasonality in time series data. ARIMA models are widely employed in finance, economics, and other fields for tasks such as stock price prediction, demand forecasting, and anomaly detection.

70
Q

CNN Hyperparameters

A

Architecture-Related
Number of Convolutional Layers
Filter Size
Stride
Number of Filters
Pooling type and Size

Training Specific
Learning Rate
Batch Size
Optimizer: Choice of algorithm (e.g., Adam, SGD with momentum, RMSprop) affects how the network updates its weights.
Regularization

Data-Related
Input Image Size
Data Augmentation: Techniques to increase data variability (rotations, flipping, noise, etc.) improve robustness and prevent overfitting.

71
Q

Dense (NN)

A

In neural networks, Dense is a type of layer that represents a fully connected layer, also known as a fully connected layer or a dense layer. In a dense layer, each neuron or node is connected to every neuron in the previous layer, forming a dense matrix of connections.

Dense layers can be used at various positions within a neural network architecture, depending on the specific task and architecture design. However, they are most commonly found towards the end of the network, especially in architectures designed for tasks such as classification or regression. Placing dense layers at the end of a neural network architecture allows for operations on the entirety of computations and features gathered earlier in the network.

Having dense layers in a neural network can indeed increase the risk of overfitting, especially when dealing with complex datasets or architectures with a large number of parameters. Dense layers have the capacity to learn intricate patterns in the training data, including noise, which may not generalize well to unseen data. Dropout is a regularization technique commonly used to mitigate overfitting in neural networks, including those with dense layers.

72
Q

Early stoppping

A

Early stopping is a technique used during the training of machine learning models to prevent overfitting. Imagine training your model as baking a cake. You want it to bake long enough to be done, but not so long that it burns. Overfitting is like burning the cake – the model learns the training data too specifically, including noise, and performs worse on new data. Early stopping acts like a timer: it monitors the model’s performance on a separate validation set. If performance starts to worsen, training is stopped, preserving the best version of the model before it starts to overfing.

Early stopping is a regularization technique used to combat overfitting during the training of machine learning models. Here’s how it works:

Monitoring Validation Performance: In addition to the data used for training, a separate validation dataset is used. The model’s performance on this validation set is monitored during each training iteration (epoch).
Detecting Worsening Performance: If the model’s performance on the validation set starts to degrade (e.g., error starts increasing), it’s a signal that overfitting is likely beginning.
Halting Training: Early stopping terminates the training process when this degradation is detected, even if the model could potentially keep improving on the training data.

Key Idea: The goal is to preserve the model’s state at the point where it generalizes best to unseen data, avoiding the over-specialization that leads to overfitt.

73
Q

Feed Forward Network (FFN)

A

A feedforward network, also known as a multilayer perceptron (MLP), is a type of artificial neural network where connections between nodes do not form cycles (i.e., no feedback connections). , information flows in a single, forward direction from input nodes through hidden layers to output nodes, without any loops or recurrent connections ( Unlike recurrent neural networks (RNNs)). Feedforward networks are versatile and can be used for various machine learning tasks, including classification, regression, and function approximation. CNN’s are FNN

Input Layer: Input features are fed into the input layer neurons.Within each layer, neurons perform calculations:Take a weighted sum of inputs from the previous layer and apply an activation function (non-linearity) to introduce complexity and help the network learn patterns
Hidden Layers: Input signals are propagated forward through one or more hidden layers, where each neuron applies a weighted sum of inputs and an activation function to produce an output.
Output Layer: The output of the last hidden layer is passed to the output layer, where the final predictions are computed.

FNN’s Learn through backpropagation: During training, the errors from the output are used to calculate gradients. These gradients are propagated backwards through the layers to update the weights, making the network better at its task.

74
Q

Gradient Boost vs. XGBoost

A

Gradient Boosting and XGBoost are both machine learning techniques used for supervised learning tasks, particularly for regression and classification. XGBoost is like gradient boosting on steroids, optimized for performance and scalability. It’s become a go-to algorithm for many structured data problems.

Gradient Boosting: Gradient Boosting is an ensemble learning technique where weak learners, typically decision trees, are trained sequentially, and each subsequent model corrects the errors made by the previous one. It optimizes a loss function by minimizing the residual errors at each step, using gradient descent.

XGBoost: XGBoost (eXtreme Gradient Boosting) is a specific implementation of gradient boosting that is optimized for speed and performance. It includes several enhancements over traditional gradient boosting, such as a regularization term to control overfitting, a more efficient algorithm for splitting nodes, and support for parallel and distributed computing. XGBoost is known for its high accuracy and efficiency and has become a popular choice for competitions on platforms like Kaggle.

Key Enhancements
- Regularization: XGBoost heavily uses regularization to prevent overfitting:
a)Penalties on model complexity (e.g., number of leaf nodes in trees)
b) Shrinkage (scales down the contribution of each tree)
- Efficient Tree Building: It introduces optimizations in the way it finds the best splits for trees:
a) Approximate greedy algorithm for split finding
b) Sparsity awareness (handling missing values effectively)
- Parallel & Hardware Optimized:
a) Parallelizes the tree construction process.
b) Designed for efficient use of computer hardware (CPU cache awareness)
- Second-Order Gradients: While regular gradient boosting uses first-order gradients, XGBoost utilizes second-order gradients to provide more information for its weight update process.

XGBoost Advantages: XGBoost is often significantly faster than traditional gradient boosting implementations. Due to its regularizations and optimizations, it usually outperforms other gradient boosting algorithms in terms of accuracy. Its computational efficiency makes it well-suited for large datasets.
When Gradient Boosting Might Still Be Better: If you need highly interpretable models, simpler gradient boosting implementations can be easier to understand. Also it is sensable to use it on smaller data sets. The overhead of XGBoost’s optimizations might not be worth it for very small datasets.

75
Q

Hidden Markov

A

Hidden Markov Models (HMMs) are statistical models used to describe sequences of observable events generated by underlying hidden states. In an HMM, the observed events form a sequence, while the hidden states represent the underlying, unobservable process that generates the observations. HMMs are characterized by two main components: transition probabilities, which describe the probability of transitioning between hidden states, and emission probabilities, which describe the probability of observing a particular event given a hidden state. HMMs are widely used in various applications such as speech recognition, natural language processing, bioinformatics, and time series analysis, where sequential data is prevalent, and the underlying structure is not directly observable.

Hidden States: A sequence of underlying states the system moves through, but these states are not directly observable.
Observations: At each timestep, you get an observation that depends on the current hidden state, but there’s some probability involved.

Real-World Examples
Speech Recognition: The underlying hidden states are the phonemes or words being spoken, the observations are the noisy sound recordings.
DNA Analysis: Hidden states represent different regions of DNA (coding, non-coding), observations are the sequences of letters (A, T, C, G).
Stock Market Modeling: Hidden states represent market conditions (bull, bear), observations are the daily stock prices.
Key Components of an HMM

HMM are associated with:
Hidden States: A set of possible states the system can be in.
Observations: A set of possible symbols that can be observed.
Transition Probabilities: The probability of moving from one hidden state to another.
Emission Probabilities: The probability of observing a particular symbol in a given hidden state.
Initial State Probabilities: The probability of the system starting in a particular state.

How HMMs are Used
Decoding: Given a sequence of observations, figuring out the most likely sequence of hidden states that generated it (e.g., figuring out the spoken words from sound recordings).
Prediction: Predicting the likelihood of future observations based on previous ones.
Learning: Adjusting the transition and emission probabilities to better fit observed data.

76
Q

Hierarchical softmax

A

Standard softmax is often used in neural networks for classification, but when you have a huge number of potential output classes (like a massive vocabulary), it becomes computationally expensive. Calculating probabilities involves a big calculation over every single class.

Hierarchical softmax tackles this by introducing a clever structure:
Tree of Output Classes: Instead of a flat list of each class, it arranges them into a tree. Each leaf node on the tree corresponds to a single output class (e.g., a specific word).
Path to Prediction: The probability of any specific word is calculated as the product of probabilities along the path from the tree’s root to that word’s leaf node.

Why It’s Faster: Instead of one giant calculation, predicting a word now involves a series of smaller decisions as you traverse down the tree (think a sequence of left or right turns). The computation time becomes proportional to the depth of the tree, which is much less than the total number of classes.

Hierarchical softmax offers a significant speed improvement, especially for large vocabularies or classification problems with many possible outcomes. Also, it can more effectively handle infrequent words, as they have a defined path in the tree, unlike in standard softmax where they get overwhelmed by more common words.

77
Q

Markov chain

A

A Markov chain is a model that describes a sequence of events where the probability of each event depends only on the state of the previous event. It’s like having a system with “short-term memory.” Imagine a frog hopping between lily pads. Its next hop only depends on which lily pad it’s currently on, not where it was before. Markov chains are used to model processes that seem somewhat random but have some underlying patterns based on the current state, such as weather patterns, text generation, or stock market fluctuations.

Markov models are all about the current state. The future depends only on the present, not the full history that led to it. This makes them relatively simple, as you don’t need to track extensive past information. They are useful for modeling systems where the next state has a clear probabilistic dependence on the current state (weather patterns, board game moves).

Bayesian approaches are all about belief updating. They start with a prior belief about something (a hypothesis, a parameter distribution) and continuously update this belief as new evidence (data) comes in. They incorporate existing knowledge or assumptions into the model. they Excel when you can factor your prior beliefs about a problem and want your model to learn incrementally over time. (Spam filtering, medical diagnosis).

Simplified Analogy
Markov: Weatherman who only looks at today’s conditions for tomorrow’s forecast.
Bayesian: Weatherman who starts with climate averages, then continuously refines their forecast as each day’s data arrives.

78
Q

Types of Layers:

A

Core Layers:
Dense (Fully Connected) Layers: The workhorse of many neural networks. Every neuron in a dense layer is connected to every neuron in the previous layer. Used for learning complex relationships between input features and for outputting final predictions.
Convolutional Layers: Designed to extract local patterns from data, especially images. They apply small filters that slide over the input, detecting features like edges and textures. Crucial for computer vision tasks.

Recurrent Layers (LSTM, GRU): Specialized for handling sequential data like text or time series. They maintain an internal memory to “remember” information from previous elements in the sequence, making them excellent for language modeling and tasks with temporal dependencies.

Normalization Layers:
Batch Normalization: Helps stabilize and speed up training by normalizing the activations of a layer across a batch of data. Reduces sensitivity to initialization and allows for higher learning rates.
Layer Normalization: Similar to batch normalization, but normalizes across the features within a single example, helpful for specific tasks like natural language processing.

Activation Layers:
ReLU (Rectified Linear Unit): Very popular due to its simplicity and ability to prevent the vanishing gradient problem. It simply outputs the input if it’s positive, otherwise outputs zero.
Sigmoid: Maps input values between 0 and 1, often used for output layers in binary classification problems (predicting probabilities).
Tanh: Similar to Sigmoid, but maps inputs between -1 and 1, sometimes helpful for certain tasks.

Pooling Layers:
Max Pooling: Downsamples feature maps by taking the maximum value within a sliding window, reducing dimensionality and making the network more robust to small data variations.
Average Pooling: Similar to max pooling, but takes the average within the window.

Other Specialized Layers:
Dropout: A regularization technique that randomly drops neurons during training to prevent overfitting.

Attention Mechanisms: Used in transformer architectures to allow the model to focus on important parts of the input sequence, crucial for advanced natural language processing tasks.

Sparsly Connected Lyers: