deck_15595778 Flashcards

Question

What does it mean if an MLP has to classify data that is not linearly separable?

Answer 1

the hidden layers must be performing computations on the inputs that yield a new, linearly separable, representation of the problem to present at the output layer. Hidden layers operate a bit like the kernel functions of SVMs

Answer 2

the perceptron rule is based on the difference between the actual and target output but we don't know in advance what the output should be when it comes to the hidden units, so we can't use that method

Answer 3

a method used in neural networks to update the weights of neurons by propagating the error signal from the output layer back to the input layer through hidden layers

Answer 4

stochastic gradient descent with backpropagation.

Answer 5

because networks are initialized with small random values for the connection weights, each run might result in a better or worse state \ \ /\ / \__/ \ / local min \ / \_/ this is the global minimum your network can get stuck in a suboptimal state (local minimum) and fail to converge to an optimal state - always train several networks with the same architecture and choose the network with the best performance

Answer 6

the threshold function does not have a derivative because it's not sloped (derivative is 0, except on the break where there is no derivative)

Answer 7

- Logistic sigmoid outputs values between 0 and 1 - Hyperbolic tangent (tanh) function outputs values between -1 and 1 - Rectified Linear (RELU) 𝑓(𝑥) = max⁡(0, 𝑥) if activation value is negative, make it 0. if activation value is positive, use it as it is.

Answer 8

To use an MLP for Non-Binary Classification: - Add an output node for every class - Add softmax layer that scales the float numbers to add up to 1, and regards the values as probability, so whichever has max probability, that will be the class e.g., 0.2 for class A, 0.5 for class B, 0.3 for class C Label is class B

Answer 9

Perceptrons can't be used for regression because they always output a 0 or 1. But with MLP you can have a single output unit with the identity activation function: 𝑓(𝑎) = 𝑎. the output layer sums up all the outputs from the previous layer the output is compared to the target and training happens much the same way

Answer 10

Feature Scaling: - MLPs are sensitive to feature scale - recommendation is to scale all the data between -1 and 1 Hidden Layer Configuration: Consider 1. look at the output of each hidden neuron as an interim classification computed along the way and learn from these interim classifications. (Usually won't do this step) 2. If a layer has more neurons than the one before, it can transform the input by adding new meta features (you're increasing the dimensionality by adding more neurons to a layer than the previous layer, in higher dimension it's possible to get a linear classification.) 3. If a layer has fewer neurons than the layer before, then it will send less information forward to the next layer. If some features are redundant, or if it’s useful to combine features, this can be a good thing. Otherwise, it might hurt performance. (trying to extract the most important info from the previous outputs / combine some of the features and reducing the dimension of the previous layers. sometimes not a good idea to reduce dimensions, you might lose info, hurt performance. if you reduce too quickly, you'll get a bottleneck in the neural network (increase gradually and decrease gradually)) 4. more layers means more epochs needed to train the network 5. An approach that often works well is to start with a large hidden layer (bigger than the input layer) and then slowly reduce the number of units in each layer until you get to the output. 6. When considering the size of the next layer, think in multiples of the previous layer. For example, if you have 1000 inputs in the previous layer, maybe consider increasing by 50% (1500) or decreasing by 25% (750). But don’t decrease suddenly from 1000 to 10 units (99% reduction) – this might make it hard to learn. Activation Function: - ReLu, tanh, sigmoid, linear (need to experiment with them) Learning Rate: - Larger values might lead to faster convergence. Smaller values might yield higher accuracy but are more likely to get stuck in a local minimum - usually start with a value like 0.001 (don't use too small or too big of a step) - need to experiment with this - if your model does not converge, it's jumping around a lot, so reduce the learning rate - best choice for learning rate is to use adaptive learning rate which is dependent on the loss function Batch size and shuffling: - With a larger batch size, you accumulate error signal over a larger number of examples before adjusting the weights - With a smaller batch size, the weights jump around a lot more, and this can help you avoid getting stuck in a local minimum. - larger batch size can lead to faster learning. - shuffle data between each epoch - helps avoid getting stuck in a local minimum Stopping Condition: - max number of epochs (max_iter in SKlearn) - when error rate is only changing by a small amount (tol in SKlearn) Regularization: - Regularization refers to a set of mathematical techniques applied to the backpropagation algorithm to avoid overfitting. - tries to prevent weights from getting too specific to the training data (alpha in SKLearn) Type of Backpropagation: - stochastic gradient descent with backpropagation - solver parameter in SKlearn

Answer 11

Deep learning is machine learning that involves very large datasets and deep neural networks with many hidden layers. Architectures for deep learning: multi-layer perceptrons (MLPs), recurrent neural networks (RNNs), generative adversarial networks (GANs), convolutional neural networks (CNNs), and transformers.

Answer 12

Training a deep neural network is expensive so developers will opt to fine-tune a network that has already been partially or fully trained for a standard learning task, hoping that the learning from that original task can be re-purposed, or “transferred”, to the task they’re interested in

Answer 13

ACCESS TO BIG DATASETS - Deep learning networks require massive amounts of data to train effectively - we now have access to freely available text data on the web and social media ACCESS TO FAST HARDWARE Graphics Processing Units (GPUs), with their large onboard memory caches and massively parallel architectures, can be repurposed for training neural networks. TECHNICAL DEVELOPMENTS solution to the vanishing gradients problem (new approach to activation function (like ReLU), new approaches to connection weight initialization, Batch Normalization ( input to each layer is rescaled prior to processing)etc)

Answer 14

Supervised learning - set of training data consisting of sets of features associated with correct outputs (either class labels or numerical values). The task of the learning algorithm was to find a model that could do a good job of predicting the correct outputs for previously unseen examples Unsupervised learning - set of data containing lists of feature values, but there is no “correct” output given and no solved examples to give to the learner. You don’t know exactly what you are looking for, but you do know what sort of thing you are looking for

Answer 15

there is always a distance calculation involved at the heart of all automatic clustering approaches

Answer 16

Advantage: easy to implement Disadvantages: slow on large data sets, does not always find the optimal solution on the first run (you must do multiple runs and take the best)

Answer 17

- based on finding the centroids of a set of clusters - chooses k centroids randomly, then assigns every point in the sample data to it's closest centroid - adjusts the centroids by computing the mean of the feature values in each cluster - those mean values become the new centroids - then reassigns every point to the new closest centroid - this process repeats until there are no further improvements to be had

Answer 18

- inertia or the Sum Squared Error SSE is measuring the distance between each point and its assigned centroid, square the results, and add them up. Squaring the results is to emphasize points that are further from the cluster

Answer 19

- random point within the feature space - random from one of the data points - analyze from dataset and choose

Answer 20

compute the distance of the new data point to each centroid - the closest centroid represents the cluster to which the new data point is assigned

Answer 21

Graph the SSE for different values of k and look for the elbow in the graph - that is the right value for k if there is no clear elbow you’ll have to think carefully about the particular problem you are trying to solve and what you want to get out of the clustering algorithm.

Answer 22

Initializing the Centroids - randomly or default is 'smart' where initial centroids are nicely spaced out Number of Runs - default will perform 10 runs and remember the best solution from those runs Stopping Condition - tolerance factor (tol). If the improvement in inertia is less than tol it stops - max_iter - stops after a certain number of iterations Verbosity - controls the verbosity level of the log output that the algorithm generates during operation

deck_15595778 Flashcards

(46 cards)