ML Code Flashcards

1
Q

What usually makes up a large fraction of the code needed to perform machine learning?

A

Data handling.

Before we can train a model, we need to get the data into a form which is compatible with the training and testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we train and test in batches?

A

Training models on the full dataset all at once would take too long.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What sample is used at each iteration of training?

A

Different random sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the main aim of building a model?

A

Make predictions on unseen data.

Classification or regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is generalisation?

A

The model’s ability to make predictions on new, unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we read in a dataset?

A

df = pd.read_csv(“file.csv”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

For image-based data, how do you create the training data?

A

train_data = torch.tensor(train_df.iloc[:, 1:].to_numpy(dtype = ‘float32’)/255).reshape(-1,1,28,28)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

For image-based data, how do you create the training labels?

A

train_labels = torch.tensor(train_df[‘label’].values)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is true of the training data and training labels shape?

A

The number of elements in the first training data result is the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we create the dataset in the form of tensors?

A

train_dataset = TensorDataset(train_data, train_labels)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the basic process for conducting data handling and training?

A

[See flashcard]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you create a training data loader?

A

train_loader = DataLoader(train_datset, batch_size=64, shuffle=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you create the testing data loader?

A

test_loader = DataLoader(test_dataset, batch_size=100, shuffle=False)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we generate a subsample?

A

Enumerate.

examples = enumerate(train_loader)

We can use the data loaders as iterators. So we can go through each batch of data and look at that at a time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does enumerate do?

A

Enumerate is a built-in function in python that allows you to keep track of the number of iterations (loops) in a loop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What can we do with the generated subsamples?

A

Look at one to invest age.

batch_idx, (example_data, example_targets) = next(examples)

Can look at shape and type of each. Keep asking for the next batch of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do we plot images from the data?

A

Iterate through the data loader, to look at a batch of images and targets. Creates an array of plots to display the images and shows the target label.

Using ax.imshow() which is part of pyplot.

fig, axs = plt.subplots(nrows=2, ncols=4, figsize=(10, 5))
axs = axs.flatten()
for ax, image, label in zip(axs,example_data,example_targets):
ax.set_axis_off()
ax.imshow(image[0], cmap=plt.cm.gray_r, interpolation=”nearest”)
ax.set_title(“Training: %i” % label)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What do we do after we have investigated the initial tensor data?

A

Flatten the images, so that each are 1D tensors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is torch?

A

The library for PyTorch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does data type = torch.float32 mean?

A

They are torch tensors and the data are stored as floating point numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why is it important that the test and training sets come from the same source?

A

If there are systematic differences, it would be hard to teach a model to deal with that. The model needs to have seen similar examples in order to learn how to interpret the images.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the torch.nn library?

A

The neural network library.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

For the basic, fully connected neural network, what should you do?

A

Flatten all of the images, to turn it into 1D data.

image_size = train_data.shape[1:]

To investigate the size:
input_layer_size =np.prod(image_size)

Each image (28x28 pixels) will then be represented by 784 numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How do we define a model which is trained to convert image pixel values to labels?

A

[See flash card]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the Net class?
The neural network class
26
What do all neural network models inherit from?
The base class nn.Module This has all the methods and attributes defined we need. It will do all of this under the hood.
27
What do we need to do in the initialisation of the model?
Define the actual neural network. self.classifier = nn.Sequential ( ... ) The input layer (data) is connected to a hidden layer (array of numbers), connected by weights.
28
What do you do with nn.Sequential()?
Pass it a sequence of layers, it builds up a graph on the layers that you pass it. It passes the data through one layer at time. This is a mathematical, matrix multiplication architecture.
29
How do you define a fully connected linear layer?
nn.Linear()
30
What do the outputs represent in classification?
How likely the image is to belong to the particular class. The higher number = more likely to be a member of that class. Larger = more confident in its answer.
31
When does a model acquire meaning?
When the weights have been trained - it learns to do the class.
32
What can you say about the model when it is first initialised?
All the weights and biases are random.
33
Why do we need self when declaring the classifier?
If we don't, it will be lost forever. The self. means we are storing the classifier inside the object whenever it is created, so we will always have access to it.
34
What else needs to be defined in the model?
A forward method. This is what happens when you give the model some data. This is how it knows what to do.
35
What arguments does forward() take?
Self and the data we will pass into it. It does some commands on that data to pass it through the model.
36
For the simple neural network, what does the forward method do?
[See flashcard] If we pass images, we need to reshape them so that instead of 28x28 pixels, we flatten it. Using the x.view command. It doesn't change the memory.
37
Why do we need to use x.view()?
To reshape the data so that it is now flattened. Just have 1 dimension.
38
What do we do with the flattened data?
Pass it to the classifier we previously defined.
39
What is the final step of the forward method?
To turn these numbers into probabilities, we use the softmax function. [See flashcard] This operation turns the numbers into a meaning of log probability, so they all add up to 1.
40
What do we do once the model is defined?
Create an instance of the model model = Net() This is now an object in memory, the model we will train.
41
Once the model is defined, what do we do?
We take a batch of training data using the enumerate command. Pass the data into the model.
42
What does the enumerate command output?
The index, the data and the label.
43
What do we do once we have passed data into the model?
Find the predictions for a given image. Use torch.max()
44
How do we utilise torch.max()?
Ask for the maximum between the output (given by the model) along the first dimension. This is the dimension with 10 numbers. ie what is the maximum of the 10 numbers. It gives two outputs - the maximum number and the position it occurred. We are interested in this. We want the predicted class with the highest probability.
45
How can we visualise the data?
Confusion matrix This tells us how many of the predictions are correct.
46
What is the code for a confusion matrix?
disp = metrics.ConfusionMatrixDisplay.from_predictions(target, predicted)
47
What is the ideal confusion matrix?
All numbers on the diagonal are large, and all others are zero. This would be a perfect model.
48
What do we next need to do?
Train the model
49
What do we set up for training the model?
A loss function. Some way of modifying the function based on this score.
50
Which loss do we do for classification?
Cross entropy loss / negative loss likelihood loss function nn.NLLLoss() Classification task - we have already converted the numbers to probabilities.
51
Which optimiser do we use?
The Adam optimiser. optimal.Adam()
52
What does the loss function do?
Tells us how good the output of the model is compared to what we want it to be. We need to know how to reward the model for good predictions.
53
Explain how the optimiser is used.
It takes the outcome of the criterion to optimise the model. We need something to use that information and change the model in some way. It calculates all of the gradients in the back propagation step. Figures out how to change the weights of the models in order to make a better output.
54
What is optim.Adam?
An adaptive optimiser that is a good algorithm for a training model.
55
What arguments does optim.Adam() take?
The model parameters (we have an instance which has a method parameters, which gives all parameters eg weights). The learning rate.
56
If the learning rate is big, what can be said of the changes to the model at each iteration?
If it is big, it will make big changes. if it is small it will make small changes. You want to balance this.
57
How do we train the model?
Iterate through the training data loader. in each iteration we have a set of 64 images and 64 target labels. We switch the model into training mode. We tell the optimiser to zero all the gradients (incase it has accumulated gradients already). This is part of the back propagation step. Then we pass our data through the model. We get 10 numbers for each image as the output. Put the 10 numbers into the criterion function, with the labels. It produces a score for how good the model did on that set of training data. We then do the backward propagation step on this score. It figures out how to change the weights to make the model better. Then we optimise using step, updating the weights to try and make it better. [See flashcard]
58
What do you put at the start of a cell to determine how long it takes a cell to run?
%%time
59
What do we do once we have a trained model?
We want to test it. Iterate through the test data.
60
How do we carry out testing?
Turn the model to evaluation mode. Disable gradient calculations. Iterate over the data loader. Pass the data to the model. Get the maximum of the numbers for each image, taking the second output. Book keeping parts - keep track of how many are true and false in a particular batch.
61
What does target.data.view_as(pred).sum() tell us?
How many of the predictions are equal to the targets in the test dataset. Gives 1 if the prediction is equal to target and 0 if not.
62
After running the testing, what do we do?
Obtain a fraction of how many of the predictions were correct.
63
In the final confusion matrix, what can we add?
normalize='true' in the brackets We can see what fraction of each category it gets correct. fig = plt.gcf() fig.set_figheight(8) fig.set_figwidth(10)
64
What different things can you do to play around with the model?
- Add in different types of layers - Make the size of the hidden layers bigger or smaller - Change the optimiser - Change the learning rate - Change the architecture - Use different activation functions - Train for multiple-epochs
65
What activation functions can you use? [Lecture 1]
Linear activation function: weighted sum of all inputs and bias Can add a non-linear activation function to increase the complexity of the model eg ReLU()
66
What is an epoch?
One complete pass of the entire training dataset through a learning algorithm
67
What other layers can you add to the sequential function?
nn.BatchNorm1d(500) nn.LeakyReLU(0.2, inplace=True) nn.Dropout(0.5)
68
How do you print the loss at each step?
Within for loop print(loss.detach())
69
What are tensors?
Fundamental objects used to handle inputs, outputs and model parameters in PyTorch. A tensor is a structure that assumes multilinear relationships.
70
What do tensors behave similarly to?
Numpy arrays They have a few extra features which help in machine learning applications. - They can be faster for calculations (especially when using GPUs) - They are optimised for automatic differentiation (required for back propagation)
71
How do you create a tensor from a list?
Initialise a tensor from a list - like arrays torch.tensor(list)
72
How do you turn a numpy array into a tensor?
torch.tensor(array)
73
How do you create an array from a tensor?
np.asarray(tensor)
74
How do you create tensors using torch commands?
eg shape = (3,2) rand_tensor = torch.rand(shape) ones_tensor = torch.ones(shape) zeros_tensor = torch.zeros(shape) constant_tensor = torch.full(shape, 4.0) identify_tensor = torch.eye(shape[0]) 1 on diagonal and 0 everywhere else
75
How do you create a tensor from another tensor?
_like x_ones = torch.ones_like(tensor) Can add in dtype = torch.float to override the datatype
76
What attributes are useful to ensuring tensors are compatible with each other and with the models in pytorch?
.shape .dtype .device tells you what device the tensor is stored on eg the cpu
77
What data type is typically used in models?
torch.float32
78
Describe the shape of this tensor torch.ones(2,2).reshape(1,2,2)
The .reshape(1,2,2) operation changes the shape while keeping the number of elements the same. The new shape is (1,2,2), meaning: 1: A new batch-like or singleton dimension. 2: The number of rows (same as before). 2: The number of columns (same as before). The same data but wrapped in an extra dimension.
79
What command do we use to reshape the data without changing its data?
.view
80
What does the code tensor_1.view(-1, 4) do?
.view() is used to reshape a tensor without changing its data. -1 is a special value that tells PyTorch to automatically infer that dimension based on the number of elements. 4 means that each new row should have exactly 4 columns. Tensor one is not actually modified.
81
What does the code tensor_1.data.view_as(tensor_2) do?
Reshapes tensor_1 to have the same shape as tensor_2.
82
What does tensor_2.eq(tensor_1.data.view_as(tensor_2)) do?
.eq() (short for equal) performs an element-wise comparison between tensor_2 and tensor_1.data.view_as(tensor_2).
83
How do we add a new dimension in pytorch?
.unsqueeze(dim) adds a new dimension at the specified position (dim). eg .unsqueeze(0) adds a new dimension at the first position.
84
What does tensor.squeeze() do?
.squeeze() removes dimensions with size 1 from a tensor.
85
How do we access elements from a tensor?
Standard numpy-like indexing and slicing. First row - tensor[0] First column - tensor[:, 0] Last column - tensor[:, -1]
86
How can you join tensors?
Use torch.cat() to concatenate a sequence of tensors along a given dimension eg t1 = torch.cat([tensor, tensor, tensor], dim=0)
87
How do you carry out matrix multiplication of tensors?
tensor @ tensor.T OR tensor.matmul(tensor.T) OR y3 = torch.rand_like(tensor) torch.matmul(tensor, tensor.T, out=y3)
88
How do you find the element-wise product of tensors?
tensor * tensor OR tensor.mul(tensor) OR z3 = torch.rand_like(tensor) torch.mul(tensor, tensor, out=z3)
89
Describe single-element tensors.
If you have a one-element tensor, for example by aggregating all values of a tensor into one value, you can convert it to a Python numerical value using item() agg = tensor.sum() agg_item = agg.item()
90
How do find the minimum of a column?
values,indices = tensor_squared.min(dim=0)
91
How do find the minimum of a row?
values,indices = tensor_squared.min(dim=1)
92
What are in-place operations?
Operations that store the result into the operand. They are denoted by a _ sufficient. eg x.copy_(y), x.t_() will change x
93
What happens in this case? tensor1 = tensor
Defining a tensor as equal to another means they share the same memory location, so changes to one affect the other.
94
How do you create a definitive copy of a tensor, rather than pointing to the same place in memory?
tensor2 = tensor.clone()
95
What is the most frequently used algorithm when training neural networks?
Back propagation. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.
96
How do we compute the gradients of the loss function with respect to the given parameter (used to adjust parameters) in PyTorch?
torch.autograd It supports automatic computation of gradient for any computational graph.
97
What parameter is used to say that we want to know gradients with respect to weights and biases?
requires_grad=True
98
In one line, what is the output of the neural network and expected outputs?
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y) Where z is a single layer of NN coded directly with tensors, z = torch.matmul(x, w)+b In this network, w and b are parameters, which we need to optimise.
99
How do we optimise the parameters, w and b of the network?
We need to be able to compute the gradients of loss functions with respect to those variables. In order to do that, we set the requires_grad property of those tensors
100
What property of a tensor stores a reference to the backward propagation function?
grad_fn You can find the gradient function for z - z.grad_fn and the loss function loss.grad_fn
101
How do we optimise the weights of parameters in the neural network?
We need to compute the derivatives of the loss function with respect to parameters, namely we need loss / dw and dloss / db under some fixed values of x and y. To compute those derivatives, we call loss.backward() and then retrieve the values from w.grad and b.grad
102
What happens if we keep calculating the backpropagation? What do we therefore need to do?
The gradient value will grow. The gradients can be reset to start again b.grad = None w.grad = None
103
What happens by default to tensors with requires_grad=True?
They are tracking their computational history and support gradient computation. However, there are some cases when we do not need to do that.
104
When would we not need to track computational gradients with requires_grad=True?
When we have trained the model and just want to apply it to some input data. We only want to do forward computations through the network.
105
How can we stop tracking gradient computations?
Surrounding our computation code with torch.no_grad() with torch.no_grad(): # code Alternatively, we could use the detach method() on the tensor z_det = z.detach()
106
Why might you want to disable gradient tracking?
- To mark some parameters in the neural network as frozen parameters. This is a very common scenario for fine-tuning a pretrained network. - To speed up computations when you are only doing forward pass, because computations on tensors that do not track gradients would be more efficient.
107
What does PyTorch do with the gradients when we perform backward propagation?
It accumulates the gradients. To compute the proper gradients, you need to zero out the grad property before. In real-life training, an optimiser helps us to do this.
108
Why is ideal to have our dataset code decoupled from our model training code?
Readability and modularity.
109
What data primitives does PyTorch provide to allow you to use pre-loaded datasets and data.
torch.utils.data.DataLoader torch.utils.data.Dataset
110
What does Dataset do?
Stores the samples and their corresponding labels.
111
What does DataLoader do?
Wraps an utterable around the Dataset, to enable easy access to the samples.
112
How do we visualise some samples in our training data?
We use matplotlib, indexing datasets manually like a list.
113
If we load our data from pandas, we can extract the labels and pixels. How do we make this into a PyTorch dataset?
Using TensorDataset. tensor_dataset = TensorDataset(test_images,test_labels)
114
How do we plot an example of the tensor dataset?
img, label = tensor_dataset[0] plt.imshow(img[0],cmap='Greys_r') plt.title(label.item()); Where image and label match the data of what was passed into TensorDataset
115
How does the Dataset retrieve our dataset's features and labels?
Retrieves the datasets features and labels one sample at a time. While training a model, we typically want to pass samples in "mini batches", reshuffle the data at every epoch to reduce novel overfitting, and use Python's multiprocessing to speed up data retrieval.
116
While training a model, we typically want to pass samples in "mini batches", reshuffle the data at every epoch to reduce novel overfitting, and use Python's multiprocessing to speed up data retrieval. What can we use to abstract this complexity into an easy API?
DataLoader dataloader = DataLoader(tensor_dataset, batch_size=64, shuffle=True) Now we can iterate through the dataset as needed.
117
When iterating through a DataLoader, what does each iteration return?
A batch of train_features and train_labels (containing 64 features and labels respectively, as determined at creation of DataLoader).
118
What does shuffle=True?
After we iterate over all batches, the data is shuffled.
119
How do we iterate to the next DataLoader iteration?
train_features, train_labels = next(iter(dataloader))
120
What is the first thing we need to do once we load data from a pandas data frame?
First convert it to tensors Convert pixel values in data frame to a numpy array, then pass this to torch.tensor.
121
How do we get the number from a tensor, rather than having the tensor itself?
tensor get .item
122
How do we build a dataset out of the tensor data?
TensorDataset()
123
What does nn eg in nn.Module represent?
The neural network library
124
What are neural networks comprised of?
Layers/modules that perform operations on data. A neural network is a module itself that consists of other modules (layers). This nested structure allows for building and managing complex architectures easily.
125
How do we define a neural network with Linear and ReLU activations?
class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10), ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits
126
How do we create an instance of the defined model?
model = NeuralNetwork()
127
How do we print the structure of the defined model?
print(model)
128
How do we use the model?
Pass it the input data. This executes the model's forward, along with some background operations.
129
What dimension of output does calling the model on the input return?
eg 10 classes The input returns a 10-dimensional tensor with raw predicted values for each class.
130
How can we get the prediction probabilities?
Passing it through an instance of the nn.Softmax module
131
When is the flatten function particularly useful?
For image data
132
What do we pass to nn.Sequential()?
Passing a sequence of layers and functions you want the nn to perform one after the other.
133
For image data, what is the input value to the sequence of layers in nn.Sequential()?
The number of inputs is the number of pixels eg 28*28
134
For image data, what is the output value of the sequence of layers in nn.Sequential()?
The number of classes possible in the prediction.
135
What does adding non-linear activation functions such as nn.ReLU() do?
Increases the complexity and the ability of the model. They are applied after linear transformations to introduce nonlinearity, helping neural networks learn a wide variety of phenomena.
136
When do models acquire meaning?
only when we do the training.
137
How do we train the model?
Create random data to pass through the model
138
How do we get the prediction probabilities from the output of the model?
Pass it through the nn.Softmax function logits = model(X) pred_probs = nn.Softmax(dim=1)(logits)
139
Of the (eg 10) probabilities output of nn.Softmax, how do we determine what the model predicted?
Using .argmax(1) y_pred = pred_probs.argmax(1) Predicted class - y_pred.item()
140
What does the loss function show?
How well the model is performing.
141
What does the nn.Flatten layer do?
Converts each 2D 28x28 image into a continuous array of 784 pixel values ( the minibatch dimension (at dim=0) is maintained).
142
What does the last linear layer of the neural network return?
Logits - these are raw values in - infinity to infinity
143
Once logins are returned from the last linear layer of the neural network, where do we put them?
We pass them to the nn.Softmax module.
144
What does the nn.Softmax module do to logins?
Scales the values between 0 and 1, representing the model's predicted probabilities for each class.
145
In softmax = nn.Softmax(dim=1), what does dim=1 mean?
The dimension along which the values must sum to is 1.
146
What do we need to pass to the NLLLoss function?
We need to provide log probabilities. This can be done more efficiently with the LogSoftmax module
147
What is an alternative to using the NLLLoss function and the LogSoftmax module?
A better approach is to take the model output logins and use the CrossEntropyLoss() loss_fcn = nn.CrossEntropyLoss() loss_fcn(logits,labels)
148
What does parameterised mean?
There are associated weights and biases that are optimised during training.
149
What does subclassing nn.Module allow for parameters?
It automatically tracks all fields defined inside the model object, and makes all parameters accessible using the model's parameters() or named_parameters() methods.
150
How do you save a model?
A common way is to serialise the internal state dictionary (containing the model). torch.save(model.state_dict(), "model.pth")
151
How do you load a model?
It includes re-creating the model structure and loading the state dictionary into it. model = NeuralNetwork() model.load_state_dict(torch.load("model.pth",weights_only=False))
152
Why are images less ideal for fully connected layers?
The images were turned into 1D tensors which caused the loss of the relative position of pixels. We may expect neighbouring pixels to be related to each other more than ones that are far apart.
153
Describe a 2D convolutional neural layer.
A set of kernels are convolved with the input to produce the output.
154
What do convolutional neural networks allow for images?
They are natural processes for revealing interesting features in images. This means that they only extract features that are likely to be important for a machine learning applciation.
155
In a convolutional layer, what does the learning?
The kernels themselves learn during the training process. ie the values of each element of the kernels are individual parameters to be trained.
156
What parameters does the convolutional layer have?
- Padding - Stride These affect how the kernel is used and the shape of the output.
157
What are other important layers for convolutional neural networks, in addition to the convolutional layers?
- Max pooling - Flatten
158
What does max pooling achieve?
Max pooling effectively shrinks down an image by taking the maximimum value over a specified range of pixels. Alternative - average pooling.
159
What does flatten achieve?
Flatten reduces the dimensionality, in a similar way to reshaping. This is required to connect a convolutional layer to a fully connected 1D layer, such as the output layer of the network.
160
How do you define a convolutional network?
class CNN(nn.Module):
161
When we have the first convolutional layer, how many input channels do we need for images?
1 if the dealing with monochrome images 3 if dealing with RGB images
162
What is the default for Conv2d in terms of padding and stride?
No padding Stride = 1
163
What two additional layers do convolutional networks include?
nn.Conv2d - this is used for the first layer, we have to respect the dimensions of the data being input. This layer requires we have a channel - for monochrome images we have 1 channel. nn.MaxPool2d
164
How do you define the loss function for a classification problem?
loss_fn = nn.CrossEntropyLoss() loss_fn(outputs,labels)
165
What does the nn.CrossEntropyLoss() do?
It takes logins as the input and performs log_softmax and then NLLLoss()
166
What does "functional version" of softmax refer to?
It is a function that directly computes the softmax. The object doesn't need to be instantiated and it can be used in line.
167
How do we import the functional version of softmax?
import torch.nn.functional as F
168
What do we pass into the training function?
def train(model,loss_fn,optimizer): - Passing in the model, data loader and optimiser
169
What is a difference between enumerate and iterate?
Enumerate has a built in way to keep track of which batch we are on, we would have to do this manually for iterate.
170
Each time we iterate through a training batch, what do we do?
- Compute the prediction error (get the predictions and pass to loss function) - Zero the gradients, carry out backwards propagation, use the optimiser to update the weights - We can print out how it is going eg plot the loss
171
What is a key difference between the training and test functions?
In the test function, we do not carry out back propagation - it is set to evaluation mode - we turn off the gradient calculation using with torch.no_grad(): The function only takes in the model and loss function (no optimiser) def test(model, loss_fn):
172
How can we see how many parameters are used in a model?
n_param = 0 for parameter in model.parameters(): n_param+=np.prod(parameter.shape) n_param
173
How do the number of parameters in a CNN compare to a fully connected model?
Convolutional layers have fewer parameters than a fully connected neural network. This is due to weight sharing. We can get far better performance for images (or natural data) where there is correlation of particular elements of data.
174
When are we more likely to experience overfitting?
When training complex models (large number of free parameters) with a limited data set.
175
What does overfitting mean?
The model will be able to fit the training data very well (potentially perfectly), but in the process learns a model that is very specific to the training data. When we are asked to make predictions on unseen data (eg testing dataset) the accuracy will be poor as the model is not generalisable. At the extreme end, the model may just be effectively "memorising" the training data, and may not have any predictive power at all.
176
How can we detect overfitting?
Comparing the training data loss with the testing data loss. If the training data loss continues to improve with every training loop, but the testing loss does not improve (or gets worse) then overfitting may be responsible.
177
What are some strategies to prevent overfitting?
- Use a larger training set - Use a smaller network - Weight sharing (as in CNNs) - Using dropout layers - Data normalisation - Data augmentation - Early stopping - Transfer learning - Model averaging - Weight decay - Batch normalisation
178
Why is data normalisation a strategy for preventing overfitting?
Neural networks generally work best with values that follow a Normal distribution ie have an average of zero and a standard deviation of one.
179
What is data augmentation?
Applying a transformation to the training data to improve its variability - artificially increasing the training dataset size
180
What does early stopping achieve?
Stop the model when the test loss stagnates rather than continuing any further.
181
What is transfer learning?
Using a retrained model for part of your neural network and then just re-training the last few layers with your data.
182
How do we achieve batch normalisation?
Adding BatchNorm1d or BatchNorm2d layers to ensure inputs are close to a normal distribution
183
Discuss how effective it is to collect a larger training set.
It may be impractical or expensive in practice.
184
Discuss how effective it is to use a smaller network.
It means that we need to restart training, rather than use what we already know about hyper parameters and appropriate weights.
185
Discuss how effective it is to carry out transfer learning.
We use pre-trained weights of a different model as part of the neural network. The architectures and weights of this were trained using a larger dataset and trained to solve a different image classification problem. Nevertheless, transfer learning allows us to leverage information form larger data sets with low computational cost. You would typically just train the last few layers with the training dataset while keeping the rest of the weights fixed.
186
How do we refer to a dataset used to check for overfitting?
It is no longer a pure test dataset, instead it is called a validation dataset.
187
What is a difference between convolutional neural networks and the fully connected layers we saw previously?
CNN - doesn’t require data reshaping - Eg flatten layers - When presented with image data, just pass them into CNN - Real life – you may need to process more, but not for the images we will see
188
How do we define the training function for CNNs?
def train(model, train_loader, test_loader, batch_size=20, num_epochs=1, learn_rate=0.001, weight_decay=0):
189
What does nn.Conv2d() take as arguments?
Three compulsory units - Number of channels in: greyscale images = 1 - Number of channels out: this is how many kernels we will load - Kernel size eg kernel size of 3x3 means 3
190
What does nn.Flatten() achieve?
This architectural part is to make sure that we end up with 10 layers
191
In which part of the code do we account for overfitting?
The training block
192
How do we plot the learning curves?
In the train function, we can plot at the end of the training process. Need to keep track of epochs, losses, val_losses and val_acc throughout
193
What other function can we create?
def get_accuracy(model, test_loader, criterion): It must be in evaluation mode.
194
What is the code for the get_accuracy function?
def get_accuracy(model, test_loader, criterion): correct = 0 total = 0 loss = 0 model.eval() #*********# for imgs, labels in test_loader: output = model(imgs) loss += criterion(output, labels).item() pred = output.max(1, keepdim=True)[1] # get the index of the max logit correct += pred.eq(labels.view_as(pred)).sum().item() total += imgs.shape[0] return loss/total, correct / total
195
What plots can you make?
- Plotting loss over epoch number - Plotting accuracy over epoch number
196
What is data normalisation?
Scaling the input features of a neural network, so that all features are called similarly (means and standard deviations). This makes the training problem easier. eg scaling so that there is a mean of 0 and standard deviation of 1 eg scaling so that things are in the range [0, 1]
197
How do we normalise the data?
train_mean = train_data.mean() train_std = train_data.std() norm = transforms.Normalize(train_mean, train_std) train_data_norm = norm(train_data)test_data_norm = norm(test_data) This transform subtracts the mean value from each pixel, and divides the result by the standard deviation. We then pass this data to TensorDataset, create DataLoaders and pass these parameters to an instantiation of the model.
198
Why is data augmentation useful?
While it is often expensive to gather more data, we can often programmatically "generate" more data points from our existing data set.
199
What are common ways of obtaining new (image) data?
- Flipping each image horizontally or vertically (won't work for digit recognition but may for other tasks) - Shifting each pixel a little to the left or rght - Rotating the images - Scaling images up or down - Adding noise tot he image - Can have a combination of these approaches
200
Programatically, how could we apply rotations/translations/scaling to the training images?
transform=transforms.RandomAffine(XXX) train_data_trans = transform(train_data) Then create another tensor dataset. train_dataset = TensorDataset(train_data_trans,train_labels) eg rotate by up to 25 degrees, translations of up to 5% of the image size, scaling from 80 to 110the original size transform=transforms.RandomAffine(25, translate=(0.05,0.05),scale=(0.8,1.1),)
201
Discuss weight decay.
Weight decay is a technique that prevents overfitting. It penalises large weights. We want to avoid large weights, because large weights mean that the prediction relies a lot on the content of one pixel, or on one unit. Intuitively, it does not make sense that classification should rely heavily on one, or a few pixels. Mathematically, we penalise large weights by adding an extra term to the loss function.
202
How is weight decay achieved in PyTorch?
Weight decay can be done automatically inside an optimiser. The parameter weight_decay of optimal.ADAM and most other optimisers uses L^2 regularisation. The value of the weight_decay parameter is another tuneable hyper parameter. train(model, train_loader_aug, test_loader, num_epochs=50,weight_decay=1e-3)
203
Discuss the dropout method.
Another way to prevent overfitting is to build many models, then average their predictions at test time. Each model might have a different set of initial weights. Dropout randomly zeros out a portion of neurons from each training iteration. This has an effect of preventing weights from being overly dependent on each other. Weights are encourage to be "more independent" of one another. We only drop out neurons during training, at test time we use the entire set of weights. This means that our training and test behaviour of dropout layers are different.
204
How do we incorporate dropout into our model?
We add a nn.Dropout2d(X) layer to our nn.Sequential eg nn.Conv2d(1, 16, 3), nn.MaxPool2d(2), nn.Dropout2d(0.5), nn.ReLU()
205
How do we incorporate normalisation into our model?
We add a nn.BatchNorm2d(X) layer to our nn.Sequential eg nn.Conv2d(1, 16, 3), nn.BatchNorm2d(16), nn.MaxPool2d(2), nn.Dropout2d(0.5), nn.ReLU()
206
What is the task of sentiment analysis?
To identify the sentiment of a particular bit of text. Eg deciding if an app review is positive or negative from the written words. The machine learning model has to learn something about language, and the meaning of particular sentences.
207
Discuss the differences between text data and numerical data.
- Text is made of characters and strings, whereas neural network deal with numbers and matrix operations - Text can be different lengths, whereas the data before were all composed of equally sized 1D or 2D numbers
208
How do we deal with text data?
We need to convert the text into numbers. This is typically done in two stages. - The text is broken up into either individual characters or individual words and symbols - Each possible character or word is assigned a number (basically a lookup table or substitution code) In this way, a sequence of characters or words can be converted to numbers where each number represents a particular possibility.
209
What piece of code do we need to turn text into separate words and symbols?
tokenizer = get_tokenizer('basic_english')
210
What does this piece of code do - tokenizer('I ate my sandwich at my desk!')?
Separates out the sentence into an array of words and punctuation. It also makes everything lowercase.
211
What piece of code do we use to ensure randomisation doesn't change the outputs?
torch.manual_seed(99) np.random.seed(99)
212
How do you split text dataframe data into a training and test set?
train_test_split(tweet_df, test_size=0.1) Splits tweet_df into 90% training and 10% testing. _ (underscore) is used to ignore the first returned value. tweet_sub_df stores 10% of the dataset. train_df,test_df = train_test_split(tweet_sub_df,test_size=0.1) Double split - The first split selects a random subset (10%) of the full dataset. - The second split divides this subset into train (90%) and test (10%).
213
How do we build a vocabulary up?
By passing all of the tokens from these tweets to build_vocab_from_iterator from torchtext.vocab import build_vocab_from_iterator def yield_tokens(df): for n, row in df.iterrows(): yield tokenizer(row[1]) vocab = build_vocab_from_iterator(yield_tokens(train_df), specials=[""]) vocab.set_default_index(vocab[""])
214
What do the references to tell the vocabulary builder?
To put all unknown tokens as a value of zero. ie tokens not in the vocabulary will be given a value of zero by the vocab object
215
What can we do with the vocab object once created?
vocab(tokenizer('I ate my sandwich at my desk!')) returns the corresponding numbers of the tokens
216
How do we investigate the corresponding number of a token and vice versa?
vocab.get_stoi()['yes'] vocab.get_itos()[0]
217
What kind of vector do we build once we have got the corresponding numbers of our tokens?
The standard approach is then to build a vector for each token or piece of text where the length of the vector is equal to the number of possible tokens. For a given token, all elements of the vector are zero except at the index which corresponds to the position of that token in the vocabulary. def make_vectors(text): indexes = vocab(tokenizer(text)) vectors = torch.zeros(len(vocab),len(indexes)) for n,ind in enumerate(indexes): vectors[ind,n]=1 return vectors text_vectors = make_vectors(text) Can then investigate .shape and .argmax(0)
218
How can we combine vectors of individual tokens?
If we sum them, we get a single vector for each piece of text that counts how many times each token appears. This text data can be passed to a neural network model.
219
What is the size of the input layer of a model for text data?
It would need to be large - equal to the total number of possible tokens.
219
What would the input of the neural network be for text data?
It would be a long vector - which is mostly zeros but has the count of each possible token in a given piece of text.
220
How do we create a function and class to turn a text dataset into numerical vectors?
- Define a function text_2_vec - Define a class CustomTextDataset def text_2_vec(text): return make_vectors(text).sum(1) class CustomTextDataset(Dataset): def __init__(self, labels, text): self.labels = labels self.text = text def __len__(self): return len(self.labels) def __getitem__(self, idx): label = self.labels[idx] text = self.text[idx] vec = text_2_vec(text) return label, vec
221
How do we prepare text data for machine learning models by converting it into a numerical format, using text_2_vec(t)
train_vectors = torch.Tensor(len(train_texts),len(vocab)) for n,t in enumerate(train_texts): train_vectors[n,:] = text_2_vec(t) This code creates an empty tensor for vectorised text. The text is converted into numerical vectors and stores them in train_vectors.
222
How do you create a training and testing dataset based on the numerical text values?
train_dataset = CustomTextDataset(train_labels,train_texts) test_dataset = CustomTextDataset(test_labels,test_texts) test_dataset[0][1].shape Using the previously defined CustomTextDataset class
223
What is the first layer of the neural network model, that eakesthe text vector as in input referred to as?
Embedding There are better and quicker ways of doing the embedding eg nn.Embedding
224
What is a common way to do the embedding for language models?
Use a pre-trained embedding layer. GloVe - global vectors for word representation GloVe embedding is an example of unsupervised learning, where the vector representation for words is learnt from a body of text by looking at the co-location of different words. The GloVe model learns which words are closely related and which are not. This enables the algorithm to place words in multidimensional representation, so that similar words are close together and different words are far part.
225
What kind of GloVe layer will we use?
A layer pre trained on millions of documents and has learned an efficient embedding for English text.
226
How do we obtain a retrained glove model?
glove = torch.load('glove6B_20000.pth') n_dim = glove.dim The dimension specifies how many dimensions are used to encode the tokens and max_vectors is how many tokens to include.
227
How do we get a 50-long vector representation of the token "yes"?
glove.get_vecs_by_tokens('yes')
228
What is returned if we pass a word not in the pre-trained vocabulary?
It will return the zero-vector
229
The glove object contains a list of all the trained words. How do we see if the glove object contains a word?
'hello' in glove.stoi
230
Why is the choice of tokeniser important?
Eg in some tokenisers, capitalisation is lost, which may be important
231
How does our neural network model for Glove differ?
The number of inputs is n_dim which is glove.dim There are two outputs - we were looking at if tweets were positive or negative.
232
How does the model using glove embedding compare?
It is quicker to run, however the training loss and test loss and accuracy performance is a bit worse. This could be improved by using a larger glove model. eg we only use 20 000words and 50 embedded dimensions. Additionally, the model using pre-trained GloVe embedding has far fewer trainable weight than the previous model.
233
There is a pytorch layer that can handle the embedding for you. It uses the glove model to turn a tensor of word indexes into embedded vectors, what is the code for this?
train_texts = train_df[1].values print(train_texts[3]) emb = nn.Embedding.from_pretrained(glove.vectors) inds = torch.tensor([vocab[t] for t in tokenizer(train_texts[3])]).reshape(1,-1,) print(inds) emb(inds).shape
234
What is the issue with reducing each tweet to a single vector?
It reduces the amount of information that was available to train the network. It would be better to pass each word in a sentence and to maintain the order of the words so that the model may learn to interpret the meaning of the texts. Recurrent neural networks (RNNs) are well suited to this task due to their ability to maintain a state which remembers the context of new words.
235
Why would we restrict the number of train articles?
To reduce train time
236
One issue with text sentiment analysis is that you may have pieces of text that are different lengths. How do we deal with this?
There is no consistent size that we could choose for the tensors. The simplistic ways involve padding and truncating tensors so that they are all the same length.
237
What number do we pad the ends with
-1 The padding makes sure we can put the indexes from many texts into a single tensor. -1 was chosen as it is easy to identify which indices are from the padding later (none of the natural words are assigned -1)
238
What do we do following padding?
Truncate down to the desired length
239
Why might you need to offset the labels by 1?
eg np.unique(train_df[0].values) shows that there are numbers 1-4, we may want to subtract 1 from these to 0-index. This allows indexes to work in the classification tasks.
240
Why can we not pass padded data straight to the embedder?
It does not like the -1s, we need to use .clip data.clip(min=0)[0] This means anything below 0 gets clipped to 0.
241
For text data, how do we pass data to the embedder?
embed_batch = nn.Embedding.from_pretrained(glove.vectors) embed_batch(data.clip(min=0)).shape
242
What do RNN modules contain?
Hidden layers that modify and are modified by the update function as each element in the string is passed.
243
Describe the code for the run layer.
rnn_layer = nn.RNN(input_size=n_dim, hidden_size=50, batch_first=True)
244
What is the input format of the RNN layer and when?
[batch_size, seq_len, repr_dim] When batch_first is true
245
If you don't specify the initial hidden state, h0, what is assumed?
It is 0s. h0 = torch.zeros(1,text_emb.shape[0],hidden_size)
246
What are the outputs of RNN?
Out – the full history of the hidden state Last_hidden is the hidden states after all elements of the sequence have been passed through it – can take this and pass it to the fully connect to do classification task out, last_hidden = rnn_layer(text_emb) could also pass in h0
247
What does the output variable contain?
The concatenation of all of the output units for each word (ie at each time point).
248
Which part of the output variable are we concerned about?
We only care about the output at the final time point. We can extract like: out[0,-1,:] For the idea layer, it is in a different order, so ask for the 0th dimension instead last_hidden[:,0,:]
249
What kind of data do we want to look at?
Meaningful data, data that went in as above 0. (torch.arange(data.shape[1])*(data[0]>=0) ).argmax() Argmax gives us the index of the final output ie the bit we want.
250
How do we define a model for text data using RNN?
- self.emb = nn.Embedding - self.rnn = nn.RNN - self.fc = nn.Sequential class TextRNN(nn.Module): def __init__(self, input_size, hidden_size, num_classes): super().__init__() self.emb = nn.Embedding.from_pretrained(glove.vectors) self.rnn = nn.RNN(input_size, hidden_size, batch_first=True) self.fc = nn.Sequential( nn.Linear(hidden_size, 50), nn.Dropout(0.2), nn.Linear(50, num_classes) )
251
What do we include in the forward function defined in the model?
- Finding the index of the last non-zero input - Apply the embedding - Forward propagate through the run - Get the last valid output - Propagate through the fc layers to the output
252
What are more powerful versions of RNN?
- Long short-term memory (LSTM) - Gated-recurrent unit (GRU) They both aim to overcome the vanishing gradients problem
253
What is the layer and implementation for LSTM?
lstm_layer = nn.LSTM(input_size=n_dim, hidden_size=50, batch_first=True)
254
What is a difference with LSTM?
LSTM keeps track of both a hidden state and a cell state, so it has an extra set of weights to initialise. h0 = torch.zeros(1, text_emb.shape[0], 50) c0 = torch.zeros(1, text_emb.shape[0], 50) out, last_hidden = lstm_layer(text_emb, (h0, c0))
255
What are autoencoders?
Adapted forms of neural networks which are generally tasked with reproducing the input - The input and output are the same for a perfect autoencoder
256
What is a characteristic of the autoencoder's architecture?
There is a hidden layer with fewer dimensions than the input, creating an information bottle neck.
257
What are the two parts of the autoencoder?
The encoder and the decoder, with the code section in between (latent representation).
258
Why can the autoencoder be considered a lossy compression algorithm?
You could use the trained encoder to reduce some data to a smaller representation, and send this somewhere else where the trained decoder is used to retrieve the original data. It is lossy because in practice the reconstruction is not perfect.
259
What is the difference for programming an autoencoder in pytorch compared to other models we have seen?
It is similar but we need to create an information bottleneck and train it using the same data as the input and output.
260
What error function do we use for the autoencoder?
The loss function is required to measure the fidelity of the reconstruction and so we will use the mean squared error of the input and output tensors.
261
What does the ConvAutoencoder class contain?
A self.encoder and self.decoder part. self.encoder - a network to connect the batch of images to the later space. The final layer outputs latent_dim which will be the size of the bottle neck. self.decoder - is a network to connect the latent space to the image reconstructions.
262
What is different about the loss calculation for the autoencoder?
Now we don't have labels (as we did in classification) - we compare the reconstructions. For this, we use the MSE loss function. nn.MSELoss()
263
What kinds of layers are good to use for image data?
Convolutional layers
264
Name any differences in the encoder and decoder networks?
self.encoder has an nn.Flatten() layer self.decoder has an nn.Unflatten() layer - this has inputs - use a square number in the linear layer to have a rectangular input here
265
What is nn.UpsamplingBilinear2d?
An optional layer - kind of like an inverse of max pool which was used to reduce the data
266
How does the training differ for an autoencoder?