ML Code Flashcards

Question

What is the Net class?

Answer 1

The neural network class

Answer 2

The base class nn.Module This has all the methods and attributes defined we need. It will do all of this under the hood.

Answer 3

Define the actual neural network. self.classifier = nn.Sequential ( ... ) The input layer (data) is connected to a hidden layer (array of numbers), connected by weights.

Answer 4

Pass it a sequence of layers, it builds up a graph on the layers that you pass it. It passes the data through one layer at time. This is a mathematical, matrix multiplication architecture.

Answer 5

nn.Linear()

Answer 6

How likely the image is to belong to the particular class. The higher number = more likely to be a member of that class. Larger = more confident in its answer.

Answer 7

When the weights have been trained - it learns to do the class.

Answer 8

All the weights and biases are random.

Answer 9

If we don't, it will be lost forever. The self. means we are storing the classifier inside the object whenever it is created, so we will always have access to it.

Answer 10

A forward method. This is what happens when you give the model some data. This is how it knows what to do.

Answer 11

Self and the data we will pass into it. It does some commands on that data to pass it through the model.

Answer 12

[See flashcard] If we pass images, we need to reshape them so that instead of 28x28 pixels, we flatten it. Using the x.view command. It doesn't change the memory.

Answer 13

To reshape the data so that it is now flattened. Just have 1 dimension.

Answer 14

Pass it to the classifier we previously defined.

Answer 15

To turn these numbers into probabilities, we use the softmax function. [See flashcard] This operation turns the numbers into a meaning of log probability, so they all add up to 1.

Answer 16

Create an instance of the model model = Net() This is now an object in memory, the model we will train.

Answer 17

We take a batch of training data using the enumerate command. Pass the data into the model.

Answer 18

The index, the data and the label.

Answer 19

Find the predictions for a given image. Use torch.max()

Answer 20

Ask for the maximum between the output (given by the model) along the first dimension. This is the dimension with 10 numbers. ie what is the maximum of the 10 numbers. It gives two outputs - the maximum number and the position it occurred. We are interested in this. We want the predicted class with the highest probability.

Answer 21

Confusion matrix This tells us how many of the predictions are correct.

Answer 22

disp = metrics.ConfusionMatrixDisplay.from_predictions(target, predicted)

Answer 23

All numbers on the diagonal are large, and all others are zero. This would be a perfect model.

Answer 24

Train the model

Answer 25

A loss function. Some way of modifying the function based on this score.

Answer 26

Cross entropy loss / negative loss likelihood loss function nn.NLLLoss() Classification task - we have already converted the numbers to probabilities.

Answer 27

The Adam optimiser. optimal.Adam()

Answer 28

Tells us how good the output of the model is compared to what we want it to be. We need to know how to reward the model for good predictions.

Answer 29

It takes the outcome of the criterion to optimise the model. We need something to use that information and change the model in some way. It calculates all of the gradients in the back propagation step. Figures out how to change the weights of the models in order to make a better output.

Answer 30

An adaptive optimiser that is a good algorithm for a training model.

Answer 31

The model parameters (we have an instance which has a method parameters, which gives all parameters eg weights). The learning rate.

Answer 32

If it is big, it will make big changes. if it is small it will make small changes. You want to balance this.

Answer 33

Iterate through the training data loader. in each iteration we have a set of 64 images and 64 target labels. We switch the model into training mode. We tell the optimiser to zero all the gradients (incase it has accumulated gradients already). This is part of the back propagation step. Then we pass our data through the model. We get 10 numbers for each image as the output. Put the 10 numbers into the criterion function, with the labels. It produces a score for how good the model did on that set of training data. We then do the backward propagation step on this score. It figures out how to change the weights to make the model better. Then we optimise using step, updating the weights to try and make it better. [See flashcard]

Answer 34

We want to test it. Iterate through the test data.

Answer 35

Turn the model to evaluation mode. Disable gradient calculations. Iterate over the data loader. Pass the data to the model. Get the maximum of the numbers for each image, taking the second output. Book keeping parts - keep track of how many are true and false in a particular batch.

Answer 36

How many of the predictions are equal to the targets in the test dataset. Gives 1 if the prediction is equal to target and 0 if not.

Answer 37

Obtain a fraction of how many of the predictions were correct.

Answer 38

normalize='true' in the brackets We can see what fraction of each category it gets correct. fig = plt.gcf() fig.set_figheight(8) fig.set_figwidth(10)

Answer 39

- Add in different types of layers - Make the size of the hidden layers bigger or smaller - Change the optimiser - Change the learning rate - Change the architecture - Use different activation functions - Train for multiple-epochs

Answer 40

Linear activation function: weighted sum of all inputs and bias Can add a non-linear activation function to increase the complexity of the model eg ReLU()

Answer 41

One complete pass of the entire training dataset through a learning algorithm

Answer 42

nn.BatchNorm1d(500) nn.LeakyReLU(0.2, inplace=True) nn.Dropout(0.5)

Answer 43

Within for loop print(loss.detach())

Answer 44

Fundamental objects used to handle inputs, outputs and model parameters in PyTorch. A tensor is a structure that assumes multilinear relationships.

Answer 45

Numpy arrays They have a few extra features which help in machine learning applications. - They can be faster for calculations (especially when using GPUs) - They are optimised for automatic differentiation (required for back propagation)

Answer 46

Initialise a tensor from a list - like arrays torch.tensor(list)

Answer 47

torch.tensor(array)

Answer 48

np.asarray(tensor)

Answer 49

eg shape = (3,2) rand_tensor = torch.rand(shape) ones_tensor = torch.ones(shape) zeros_tensor = torch.zeros(shape) constant_tensor = torch.full(shape, 4.0) identify_tensor = torch.eye(shape[0]) 1 on diagonal and 0 everywhere else

Answer 50

_like x_ones = torch.ones_like(tensor) Can add in dtype = torch.float to override the datatype

Answer 51

.shape .dtype .device tells you what device the tensor is stored on eg the cpu

Answer 52

torch.float32

Answer 53

The .reshape(1,2,2) operation changes the shape while keeping the number of elements the same. The new shape is (1,2,2), meaning: 1: A new batch-like or singleton dimension. 2: The number of rows (same as before). 2: The number of columns (same as before). The same data but wrapped in an extra dimension.

Answer 54

.view() is used to reshape a tensor without changing its data. -1 is a special value that tells PyTorch to automatically infer that dimension based on the number of elements. 4 means that each new row should have exactly 4 columns. Tensor one is not actually modified.

Answer 55

Reshapes tensor_1 to have the same shape as tensor_2.

Answer 56

.eq() (short for equal) performs an element-wise comparison between tensor_2 and tensor_1.data.view_as(tensor_2).

Answer 57

.unsqueeze(dim) adds a new dimension at the specified position (dim). eg .unsqueeze(0) adds a new dimension at the first position.

Answer 58

.squeeze() removes dimensions with size 1 from a tensor.

Answer 59

Standard numpy-like indexing and slicing. First row - tensor[0] First column - tensor[:, 0] Last column - tensor[:, -1]

Answer 60

Use torch.cat() to concatenate a sequence of tensors along a given dimension eg t1 = torch.cat([tensor, tensor, tensor], dim=0)

Answer 61

tensor @ tensor.T OR tensor.matmul(tensor.T) OR y3 = torch.rand_like(tensor) torch.matmul(tensor, tensor.T, out=y3)

Answer 62

tensor * tensor OR tensor.mul(tensor) OR z3 = torch.rand_like(tensor) torch.mul(tensor, tensor, out=z3)

Answer 63

If you have a one-element tensor, for example by aggregating all values of a tensor into one value, you can convert it to a Python numerical value using item() agg = tensor.sum() agg_item = agg.item()

Answer 64

values,indices = tensor_squared.min(dim=0)

Answer 65

values,indices = tensor_squared.min(dim=1)

Answer 66

Operations that store the result into the operand. They are denoted by a _ sufficient. eg x.copy_(y), x.t_() will change x

Answer 67

Defining a tensor as equal to another means they share the same memory location, so changes to one affect the other.

Answer 68

tensor2 = tensor.clone()

Answer 69

Back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

Answer 70

torch.autograd It supports automatic computation of gradient for any computational graph.

Answer 71

requires_grad=True

Answer 72

loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y) Where z is a single layer of NN coded directly with tensors, z = torch.matmul(x, w)+b In this network, w and b are parameters, which we need to optimise.

Answer 73

We need to be able to compute the gradients of loss functions with respect to those variables. In order to do that, we set the requires_grad property of those tensors

Answer 74

grad_fn You can find the gradient function for z - z.grad_fn and the loss function loss.grad_fn

Answer 75

We need to compute the derivatives of the loss function with respect to parameters, namely we need loss / dw and dloss / db under some fixed values of x and y. To compute those derivatives, we call loss.backward() and then retrieve the values from w.grad and b.grad

Answer 76

The gradient value will grow. The gradients can be reset to start again b.grad = None w.grad = None

Answer 77

They are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that.

Answer 78

When we have trained the model and just want to apply it to some input data. We only want to do forward computations through the network.

Answer 79

Surrounding our computation code with torch.no_grad() with torch.no_grad(): # code Alternatively, we could use the detach method() on the tensor z_det = z.detach()

Answer 80

- To mark some parameters in the neural network as frozen parameters. This is a very common scenario for fine-tuning a pretrained network. - To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.

Answer 81

It accumulates the gradients. To compute the proper gradients, you need to zero out the grad property before. In real-life training, an optimiser helps us to do this.

Answer 82

Readability and modularity.

Answer 83

torch.utils.data.DataLoader torch.utils.data.Dataset

Answer 84

Stores the samples and their corresponding labels.

Answer 85

Wraps an utterable around the Dataset, to enable easy access to the samples.

Answer 86

We use matplotlib, indexing datasets manually like a list.

Answer 87

Using TensorDataset. tensor_dataset = TensorDataset(test_images,test_labels)

Answer 88

img, label = tensor_dataset[0] plt.imshow(img[0],cmap='Greys_r') plt.title(label.item()); Where image and label match the data of what was passed into TensorDataset

Answer 89

Retrieves the datasets features and labels one sample at a time. While training a model, we typically want to pass samples in "mini batches", reshuffle the data at every epoch to reduce novel overfitting, and use Python's multiprocessing to speed up data retrieval.

Answer 90

DataLoader dataloader = DataLoader(tensor_dataset, batch_size=64, shuffle=True) Now we can iterate through the dataset as needed.

Answer 91

A batch of train_features and train_labels (containing 64 features and labels respectively, as determined at creation of DataLoader).

Answer 92

After we iterate over all batches, the data is shuffled.

Answer 93

train_features, train_labels = next(iter(dataloader))

Answer 94

First convert it to tensors Convert pixel values in data frame to a numpy array, then pass this to torch.tensor.

Answer 95

tensor get .item

Answer 96

TensorDataset()

Answer 97

The neural network library

Answer 98

Layers/modules that perform operations on data. A neural network is a module itself that consists of other modules (layers). This nested structure allows for building and managing complex architectures easily.

Answer 99

class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10), ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits

Answer 100

model = NeuralNetwork()

Answer 101

print(model)

Answer 102

Pass it the input data. This executes the model's forward, along with some background operations.

Answer 103

eg 10 classes The input returns a 10-dimensional tensor with raw predicted values for each class.

Answer 104

Passing it through an instance of the nn.Softmax module

Answer 105

For image data

Answer 106

Passing a sequence of layers and functions you want the nn to perform one after the other.

Answer 107

The number of inputs is the number of pixels eg 28*28

Answer 108

The number of classes possible in the prediction.

Answer 109

Increases the complexity and the ability of the model. They are applied after linear transformations to introduce nonlinearity, helping neural networks learn a wide variety of phenomena.

Answer 110

only when we do the training.

Answer 111

Create random data to pass through the model

Answer 112

Pass it through the nn.Softmax function logits = model(X) pred_probs = nn.Softmax(dim=1)(logits)

Answer 113

Using .argmax(1) y_pred = pred_probs.argmax(1) Predicted class - y_pred.item()

Answer 114

How well the model is performing.

Answer 115

Converts each 2D 28x28 image into a continuous array of 784 pixel values ( the minibatch dimension (at dim=0) is maintained).

Answer 116

Logits - these are raw values in - infinity to infinity

Answer 117

We pass them to the nn.Softmax module.

Answer 118

Scales the values between 0 and 1, representing the model's predicted probabilities for each class.

Answer 119

The dimension along which the values must sum to is 1.

Answer 120

We need to provide log probabilities. This can be done more efficiently with the LogSoftmax module

Answer 121

A better approach is to take the model output logins and use the CrossEntropyLoss() loss_fcn = nn.CrossEntropyLoss() loss_fcn(logits,labels)

Answer 122

There are associated weights and biases that are optimised during training.

Answer 123

It automatically tracks all fields defined inside the model object, and makes all parameters accessible using the model's parameters() or named_parameters() methods.

Answer 124

A common way is to serialise the internal state dictionary (containing the model). torch.save(model.state_dict(), "model.pth")

Answer 125

It includes re-creating the model structure and loading the state dictionary into it. model = NeuralNetwork() model.load_state_dict(torch.load("model.pth",weights_only=False))

Answer 126

The images were turned into 1D tensors which caused the loss of the relative position of pixels. We may expect neighbouring pixels to be related to each other more than ones that are far apart.

Answer 127

A set of kernels are convolved with the input to produce the output.

Answer 128

They are natural processes for revealing interesting features in images. This means that they only extract features that are likely to be important for a machine learning applciation.

Answer 129

The kernels themselves learn during the training process. ie the values of each element of the kernels are individual parameters to be trained.

Answer 130

- Padding - Stride These affect how the kernel is used and the shape of the output.

Answer 131

- Max pooling - Flatten

Answer 132

Max pooling effectively shrinks down an image by taking the maximimum value over a specified range of pixels. Alternative - average pooling.

Answer 133

Flatten reduces the dimensionality, in a similar way to reshaping. This is required to connect a convolutional layer to a fully connected 1D layer, such as the output layer of the network.

Answer 134

class CNN(nn.Module):

Answer 135

1 if the dealing with monochrome images 3 if dealing with RGB images

Answer 136

No padding Stride = 1

Answer 137

nn.Conv2d - this is used for the first layer, we have to respect the dimensions of the data being input. This layer requires we have a channel - for monochrome images we have 1 channel. nn.MaxPool2d

Answer 138

loss_fn = nn.CrossEntropyLoss() loss_fn(outputs,labels)

Answer 139

It takes logins as the input and performs log_softmax and then NLLLoss()

Answer 140

It is a function that directly computes the softmax. The object doesn't need to be instantiated and it can be used in line.

Answer 141

import torch.nn.functional as F

Answer 142

def train(model,loss_fn,optimizer): - Passing in the model, data loader and optimiser

Answer 143

Enumerate has a built in way to keep track of which batch we are on, we would have to do this manually for iterate.

Answer 144

- Compute the prediction error (get the predictions and pass to loss function) - Zero the gradients, carry out backwards propagation, use the optimiser to update the weights - We can print out how it is going eg plot the loss

Answer 145

In the test function, we do not carry out back propagation - it is set to evaluation mode - we turn off the gradient calculation using with torch.no_grad(): The function only takes in the model and loss function (no optimiser) def test(model, loss_fn):

Answer 146

n_param = 0 for parameter in model.parameters(): n_param+=np.prod(parameter.shape) n_param

Answer 147

Convolutional layers have fewer parameters than a fully connected neural network. This is due to weight sharing. We can get far better performance for images (or natural data) where there is correlation of particular elements of data.

Answer 148

When training complex models (large number of free parameters) with a limited data set.

Answer 149

The model will be able to fit the training data very well (potentially perfectly), but in the process learns a model that is very specific to the training data. When we are asked to make predictions on unseen data (eg testing dataset) the accuracy will be poor as the model is not generalisable. At the extreme end, the model may just be effectively "memorising" the training data, and may not have any predictive power at all.

Answer 150

Comparing the training data loss with the testing data loss. If the training data loss continues to improve with every training loop, but the testing loss does not improve (or gets worse) then overfitting may be responsible.

Answer 151

- Use a larger training set - Use a smaller network - Weight sharing (as in CNNs) - Using dropout layers - Data normalisation - Data augmentation - Early stopping - Transfer learning - Model averaging - Weight decay - Batch normalisation

Answer 152

Neural networks generally work best with values that follow a Normal distribution ie have an average of zero and a standard deviation of one.

Answer 153

Applying a transformation to the training data to improve its variability - artificially increasing the training dataset size

Answer 154

Stop the model when the test loss stagnates rather than continuing any further.

Answer 155

Using a retrained model for part of your neural network and then just re-training the last few layers with your data.

Answer 156

Adding BatchNorm1d or BatchNorm2d layers to ensure inputs are close to a normal distribution

Answer 157

It may be impractical or expensive in practice.

Answer 158

It means that we need to restart training, rather than use what we already know about hyper parameters and appropriate weights.

Answer 159

We use pre-trained weights of a different model as part of the neural network. The architectures and weights of this were trained using a larger dataset and trained to solve a different image classification problem. Nevertheless, transfer learning allows us to leverage information form larger data sets with low computational cost. You would typically just train the last few layers with the training dataset while keeping the rest of the weights fixed.

Answer 160

It is no longer a pure test dataset, instead it is called a validation dataset.

Answer 161

CNN - doesn’t require data reshaping - Eg flatten layers - When presented with image data, just pass them into CNN - Real life – you may need to process more, but not for the images we will see

Answer 162

def train(model, train_loader, test_loader, batch_size=20, num_epochs=1, learn_rate=0.001, weight_decay=0):

Answer 163

Three compulsory units - Number of channels in: greyscale images = 1 - Number of channels out: this is how many kernels we will load - Kernel size eg kernel size of 3x3 means 3

Answer 164

This architectural part is to make sure that we end up with 10 layers

Answer 165

The training block

Answer 166

In the train function, we can plot at the end of the training process. Need to keep track of epochs, losses, val_losses and val_acc throughout

Answer 167

def get_accuracy(model, test_loader, criterion): It must be in evaluation mode.

Answer 168

def get_accuracy(model, test_loader, criterion): correct = 0 total = 0 loss = 0 model.eval() #*********# for imgs, labels in test_loader: output = model(imgs) loss += criterion(output, labels).item() pred = output.max(1, keepdim=True)[1] # get the index of the max logit correct += pred.eq(labels.view_as(pred)).sum().item() total += imgs.shape[0] return loss/total, correct / total

Answer 169

- Plotting loss over epoch number - Plotting accuracy over epoch number

Answer 170

Scaling the input features of a neural network, so that all features are called similarly (means and standard deviations). This makes the training problem easier. eg scaling so that there is a mean of 0 and standard deviation of 1 eg scaling so that things are in the range [0, 1]

Answer 171

train_mean = train_data.mean() train_std = train_data.std() norm = transforms.Normalize(train_mean, train_std) train_data_norm = norm(train_data)test_data_norm = norm(test_data) This transform subtracts the mean value from each pixel, and divides the result by the standard deviation. We then pass this data to TensorDataset, create DataLoaders and pass these parameters to an instantiation of the model.

Answer 172

While it is often expensive to gather more data, we can often programmatically "generate" more data points from our existing data set.

Answer 173

- Flipping each image horizontally or vertically (won't work for digit recognition but may for other tasks) - Shifting each pixel a little to the left or rght - Rotating the images - Scaling images up or down - Adding noise tot he image - Can have a combination of these approaches

Answer 174

transform=transforms.RandomAffine(XXX) train_data_trans = transform(train_data) Then create another tensor dataset. train_dataset = TensorDataset(train_data_trans,train_labels) eg rotate by up to 25 degrees, translations of up to 5% of the image size, scaling from 80 to 110the original size transform=transforms.RandomAffine(25, translate=(0.05,0.05),scale=(0.8,1.1),)

Answer 175

Weight decay is a technique that prevents overfitting. It penalises large weights. We want to avoid large weights, because large weights mean that the prediction relies a lot on the content of one pixel, or on one unit. Intuitively, it does not make sense that classification should rely heavily on one, or a few pixels. Mathematically, we penalise large weights by adding an extra term to the loss function.

Answer 176

Weight decay can be done automatically inside an optimiser. The parameter weight_decay of optimal.ADAM and most other optimisers uses L^2 regularisation. The value of the weight_decay parameter is another tuneable hyper parameter. train(model, train_loader_aug, test_loader, num_epochs=50,weight_decay=1e-3)

Answer 177

Another way to prevent overfitting is to build many models, then average their predictions at test time. Each model might have a different set of initial weights. Dropout randomly zeros out a portion of neurons from each training iteration. This has an effect of preventing weights from being overly dependent on each other. Weights are encourage to be "more independent" of one another. We only drop out neurons during training, at test time we use the entire set of weights. This means that our training and test behaviour of dropout layers are different.

Answer 178

We add a nn.Dropout2d(X) layer to our nn.Sequential eg nn.Conv2d(1, 16, 3), nn.MaxPool2d(2), nn.Dropout2d(0.5), nn.ReLU()

Answer 179

We add a nn.BatchNorm2d(X) layer to our nn.Sequential eg nn.Conv2d(1, 16, 3), nn.BatchNorm2d(16), nn.MaxPool2d(2), nn.Dropout2d(0.5), nn.ReLU()

Answer 180

To identify the sentiment of a particular bit of text. Eg deciding if an app review is positive or negative from the written words. The machine learning model has to learn something about language, and the meaning of particular sentences.

Answer 181

- Text is made of characters and strings, whereas neural network deal with numbers and matrix operations - Text can be different lengths, whereas the data before were all composed of equally sized 1D or 2D numbers

Answer 182

We need to convert the text into numbers. This is typically done in two stages. - The text is broken up into either individual characters or individual words and symbols - Each possible character or word is assigned a number (basically a lookup table or substitution code) In this way, a sequence of characters or words can be converted to numbers where each number represents a particular possibility.

Answer 183

tokenizer = get_tokenizer('basic_english')

Answer 184

Separates out the sentence into an array of words and punctuation. It also makes everything lowercase.

Answer 185

torch.manual_seed(99) np.random.seed(99)

Answer 186

train_test_split(tweet_df, test_size=0.1) Splits tweet_df into 90% training and 10% testing. _ (underscore) is used to ignore the first returned value. tweet_sub_df stores 10% of the dataset. train_df,test_df = train_test_split(tweet_sub_df,test_size=0.1) Double split - The first split selects a random subset (10%) of the full dataset. - The second split divides this subset into train (90%) and test (10%).

Answer 187

By passing all of the tokens from these tweets to build_vocab_from_iterator from torchtext.vocab import build_vocab_from_iterator def yield_tokens(df): for n, row in df.iterrows(): yield tokenizer(row[1]) vocab = build_vocab_from_iterator(yield_tokens(train_df), specials=[""]) vocab.set_default_index(vocab[""])

Answer 188

To put all unknown tokens as a value of zero. ie tokens not in the vocabulary will be given a value of zero by the vocab object

Answer 189

vocab(tokenizer('I ate my sandwich at my desk!')) returns the corresponding numbers of the tokens

Answer 190

vocab.get_stoi()['yes'] vocab.get_itos()[0]

Answer 191

The standard approach is then to build a vector for each token or piece of text where the length of the vector is equal to the number of possible tokens. For a given token, all elements of the vector are zero except at the index which corresponds to the position of that token in the vocabulary. def make_vectors(text): indexes = vocab(tokenizer(text)) vectors = torch.zeros(len(vocab),len(indexes)) for n,ind in enumerate(indexes): vectors[ind,n]=1 return vectors text_vectors = make_vectors(text) Can then investigate .shape and .argmax(0)

Answer 192

If we sum them, we get a single vector for each piece of text that counts how many times each token appears. This text data can be passed to a neural network model.

Answer 193

It would need to be large - equal to the total number of possible tokens.

Answer 194

It would be a long vector - which is mostly zeros but has the count of each possible token in a given piece of text.

Answer 195

- Define a function text_2_vec - Define a class CustomTextDataset def text_2_vec(text): return make_vectors(text).sum(1) class CustomTextDataset(Dataset): def __init__(self, labels, text): self.labels = labels self.text = text def __len__(self): return len(self.labels) def __getitem__(self, idx): label = self.labels[idx] text = self.text[idx] vec = text_2_vec(text) return label, vec

Answer 196

train_vectors = torch.Tensor(len(train_texts),len(vocab)) for n,t in enumerate(train_texts): train_vectors[n,:] = text_2_vec(t) This code creates an empty tensor for vectorised text. The text is converted into numerical vectors and stores them in train_vectors.

Answer 197

train_dataset = CustomTextDataset(train_labels,train_texts) test_dataset = CustomTextDataset(test_labels,test_texts) test_dataset[0][1].shape Using the previously defined CustomTextDataset class

Answer 198

Embedding There are better and quicker ways of doing the embedding eg nn.Embedding

Answer 199

Use a pre-trained embedding layer. GloVe - global vectors for word representation GloVe embedding is an example of unsupervised learning, where the vector representation for words is learnt from a body of text by looking at the co-location of different words. The GloVe model learns which words are closely related and which are not. This enables the algorithm to place words in multidimensional representation, so that similar words are close together and different words are far part.

Answer 200

A layer pre trained on millions of documents and has learned an efficient embedding for English text.

Answer 201

glove = torch.load('glove6B_20000.pth') n_dim = glove.dim The dimension specifies how many dimensions are used to encode the tokens and max_vectors is how many tokens to include.

Answer 202

glove.get_vecs_by_tokens('yes')

Answer 203

It will return the zero-vector

Answer 204

'hello' in glove.stoi

Answer 205

Eg in some tokenisers, capitalisation is lost, which may be important

Answer 206

The number of inputs is n_dim which is glove.dim There are two outputs - we were looking at if tweets were positive or negative.

Answer 207

It is quicker to run, however the training loss and test loss and accuracy performance is a bit worse. This could be improved by using a larger glove model. eg we only use 20 000words and 50 embedded dimensions. Additionally, the model using pre-trained GloVe embedding has far fewer trainable weight than the previous model.

Answer 208

train_texts = train_df[1].values print(train_texts[3]) emb = nn.Embedding.from_pretrained(glove.vectors) inds = torch.tensor([vocab[t] for t in tokenizer(train_texts[3])]).reshape(1,-1,) print(inds) emb(inds).shape

Answer 209

It reduces the amount of information that was available to train the network. It would be better to pass each word in a sentence and to maintain the order of the words so that the model may learn to interpret the meaning of the texts. Recurrent neural networks (RNNs) are well suited to this task due to their ability to maintain a state which remembers the context of new words.

Answer 210

To reduce train time

Answer 211

There is no consistent size that we could choose for the tensors. The simplistic ways involve padding and truncating tensors so that they are all the same length.

Answer 212

-1 The padding makes sure we can put the indexes from many texts into a single tensor. -1 was chosen as it is easy to identify which indices are from the padding later (none of the natural words are assigned -1)

Answer 213

Truncate down to the desired length

Answer 214

eg np.unique(train_df[0].values) shows that there are numbers 1-4, we may want to subtract 1 from these to 0-index. This allows indexes to work in the classification tasks.

Answer 215

It does not like the -1s, we need to use .clip data.clip(min=0)[0] This means anything below 0 gets clipped to 0.

Answer 216

embed_batch = nn.Embedding.from_pretrained(glove.vectors) embed_batch(data.clip(min=0)).shape

Answer 217

Hidden layers that modify and are modified by the update function as each element in the string is passed.

Answer 218

rnn_layer = nn.RNN(input_size=n_dim, hidden_size=50, batch_first=True)

Answer 219

[batch_size, seq_len, repr_dim] When batch_first is true

Answer 220

It is 0s. h0 = torch.zeros(1,text_emb.shape[0],hidden_size)

Answer 221

Out – the full history of the hidden state Last_hidden is the hidden states after all elements of the sequence have been passed through it – can take this and pass it to the fully connect to do classification task out, last_hidden = rnn_layer(text_emb) could also pass in h0

Answer 222

The concatenation of all of the output units for each word (ie at each time point).

Answer 223

We only care about the output at the final time point. We can extract like: out[0,-1,:] For the idea layer, it is in a different order, so ask for the 0th dimension instead last_hidden[:,0,:]

Answer 224

Meaningful data, data that went in as above 0. (torch.arange(data.shape[1])*(data[0]>=0) ).argmax() Argmax gives us the index of the final output ie the bit we want.

Answer 225

- self.emb = nn.Embedding - self.rnn = nn.RNN - self.fc = nn.Sequential class TextRNN(nn.Module): def __init__(self, input_size, hidden_size, num_classes): super().__init__() self.emb = nn.Embedding.from_pretrained(glove.vectors) self.rnn = nn.RNN(input_size, hidden_size, batch_first=True) self.fc = nn.Sequential( nn.Linear(hidden_size, 50), nn.Dropout(0.2), nn.Linear(50, num_classes) )

Answer 226

- Finding the index of the last non-zero input - Apply the embedding - Forward propagate through the run - Get the last valid output - Propagate through the fc layers to the output

Answer 227

- Long short-term memory (LSTM) - Gated-recurrent unit (GRU) They both aim to overcome the vanishing gradients problem

Answer 228

lstm_layer = nn.LSTM(input_size=n_dim, hidden_size=50, batch_first=True)

Answer 229

LSTM keeps track of both a hidden state and a cell state, so it has an extra set of weights to initialise. h0 = torch.zeros(1, text_emb.shape[0], 50) c0 = torch.zeros(1, text_emb.shape[0], 50) out, last_hidden = lstm_layer(text_emb, (h0, c0))

Answer 230

Adapted forms of neural networks which are generally tasked with reproducing the input - The input and output are the same for a perfect autoencoder

Answer 231

There is a hidden layer with fewer dimensions than the input, creating an information bottle neck.

Answer 232

The encoder and the decoder, with the code section in between (latent representation).

Answer 233

You could use the trained encoder to reduce some data to a smaller representation, and send this somewhere else where the trained decoder is used to retrieve the original data. It is lossy because in practice the reconstruction is not perfect.

Answer 234

It is similar but we need to create an information bottleneck and train it using the same data as the input and output.

Answer 235

The loss function is required to measure the fidelity of the reconstruction and so we will use the mean squared error of the input and output tensors.

Answer 236

A self.encoder and self.decoder part. self.encoder - a network to connect the batch of images to the later space. The final layer outputs latent_dim which will be the size of the bottle neck. self.decoder - is a network to connect the latent space to the image reconstructions.

Answer 237

Now we don't have labels (as we did in classification) - we compare the reconstructions. For this, we use the MSE loss function. nn.MSELoss()

Answer 238

Convolutional layers

Answer 239

self.encoder has an nn.Flatten() layer self.decoder has an nn.Unflatten() layer - this has inputs - use a square number in the linear layer to have a rectangular input here

Answer 240

An optional layer - kind of like an inverse of max pool which was used to reduce the data

ML Code Flashcards

(267 cards)