test exam Flashcards
(15 cards)
What shape is this array?
(Think of | as one big bracket [ , can’t show pictures)
|4 1 0 9 0|
|1 1 2 9 7|
None
(2)
(10)
(2, 5)
(5, 2)
(2, 5)
What is the name of this activation function?
(Think of | as a big curly bracket { )
——–|x, x > 1
f(x) = |
——–||0 , x <= 0
Hyperbolic tangent (tanh)
Sigmodi (logistic)
Rectified linear unit (ReLU)
Heaviside (step)
Rectified linear unit (ReLU)
What is the point of using activation functions?
Allow the network to learn nonlinear relationships
Eliminate the need for bias parameters
Speed up computation
Regularise the network
Allow the network to learn nonlinear relationships
We want to design a neural network to perform classification on data containing 10 distinct, non-overlapping classes. There are in total 10 000 data points, with 32 features. What should
the output layer of our network look like?
keras.layers.Dense(units=10, activation=”relu”)
keras.layers.Dense(units=32, activation=None)
keras.layers.Dense(units=10, activation=”softmax”)
keras.layers.Dense(units=32, activation=”softmax”)
keras.layers.Dense(units=10000, activation=”tanh”)
keras.layers.Dense(units=10, activation=”softmax”)
10 classes mean we need 10 output nodes. Softmax activation will give us proper output values, that lie in the interval [0, 1] and sum up to 1.
We now want to design an autoencoder neural network, to remove noise from input data. We use the same dataset as in the previous question (10 distinct classes, 10 000 data points, with 32 features). What should the output layer of our network look like?
keras.layers.Dense(units=10, activation=”relu”)
keras.layers.Dense(units=32, activation=None)
keras.layers.Dense(units=10, activation=”softmax”)
keras.layers.Dense(units=32, activation=”softmax”)
keras.layers.Dense(units=10000, activation=”tanh”)
keras.layers.Dense(units=32, activation=None)
An autoencoder is doing regression on the inputs, which means we should have 32 output nodes with linear (=None) activation.
How does the Gated Recurrent Unit (GRU) architecture improve upon the simplest implementation of a recurrent network?
It unrolls the sequential for loops, increasing processing speed
It memorises all past time steps
It introduces a single new weight matrix, which is applied to the input from previous time steps
It introduces a hidden state vector, which is propagated between time steps
It introduces a hidden state vector, which is propagated between time steps
The simple RNN takes its previous output as additional input to the next timestep, while the GRU takes its previous hidden state as additional input (the output at previous timestep is not propagated to the next)
What is a common problem when training deep recurrent neural networks, and how can it be mitigated?
Exploding gradients. May be solved by choosing a saturating activation function like tanh.
Vanishing gradients. May be solved by regularising the network weights.
Multiple input channels are not supported. May be solved by training models in parallel.
Memory usage increases exponentially with sequence length. May be solved by distributing input sequences over multiple batches.
Exploding gradients. May be solved by choosing a saturating activation function like tanh.
In the context of large language models, what does temperature do?
It sets a limit on GPU power usage, to prevent overheating
It adjusts the probability distribution from which new tokens are sampled
It scales the attention weights in the transformer blocks
It is a measure of hallucinations in the output text
It adjusts the probability distribution from which new tokens are sampled
Why should we not initialise all network parameters to zero, but rather initialise them to random values?
If all parameters are the same value (don not even need to be zero), all nodes perform exacly the same computation, and backpropagation will adjust all parameters similarly – meaning
all the nodes in each layer keep performing identical computation. Effectively we have only a single node per layer, which is not very useful.
We apply a convolutional layer to a single, tiny RGB-colour image with dimensions of 3 × 3 pixels. The convolutional layer is defined as:
Conv2D(filters=5, kernel_size=(2, 2), strides=(1, 1), padding=’valid’)
What will be the dimensions (shape) of the output feature map? Explain your thinking. (In case you do not find the correct values, you will still be credited for your explanation.)
We start with an input of shape (1, 3, 3, 3), being (batch, height, width, color channels). padding=’valid’ means the convolution kernel is applied only inside the boundaries of the image. strides=(1, 1) means we take steps of one pixel at a time. If we start
in the upper left corner of the image, our filter (kernel) can only take one step to the right before we meet the edge (meaning the output width is 2), and similarly only one step down
(meaning the output height is 2). filters=5 means we do this for 5 different filters, each of which merges the colour channels into a filter channel, resulting in an output shape of (1, 2, 2, 5). Forgetting the batch dimension would still give credit.
In a transformer model, how can we encode positional information, such as the position of a word in a sentence?
Word positions are integers, but we want them as vectors, so we can them with the word embeddings. We know two ways of doing so (either one would give full credit):
- Embed them same way as for tokens, using keras.layers.Embedding()
- Sinusoidal encoding, as used in the original attention paper
Which two metrics are used to compute similarity between vectors in an embedding space? What is the main difference between the two?
Euclidian distance, and cosine similarity. The latter incorporates only the angle between the vectors, while the former incorporates vector magnitudes as well.
Describe the concept of how we train a diffusion model, and how we can use it to generate new data.
The encoding (forward diffusion) step adds a small amount of random noise, while the decoding (denoising) step uses a deep learning model to predict and remove the noise. When generating
new data, we start from pure random noise, and run the decoder step sequentially to produce a noise-free data point.
A friend of yours, who has not followed DAT255, makes the following claim:
If you ask a large language model a difficult question, it takes longer to give an answer, because it has to think harder before answering.
Using your knowledge from the course, argue for whether this is true or not
Here one can reflect upon the qualities of modern state-of-the-art LLMs using chain-of-thought reasoning and/or a mixture-of-experts setup, where the difficulty of producing credible output
can affect the processing time. This is however beyond the scope of the course, so an answer based on “plain” transformer-based LLMs is perfectly fine.
From the course we know the following:
- Generally, a forward pass of a neural network performs a fixed number of computations, hence it has deterministic runtime.
- More specifically, LLMs are based on the attention mechanism, where we need to multiply matrices that dimensions determined by the sequence length (𝑁) and the size of the embedding space (𝐷). 𝐷 is constant for all inputs, but how about the length of the input, 𝑁? Recall from the notebooks that even though sentences usually vary in length, we had to decide on a sequence length and either pad or truncate the end of the sentence. So even for short inputs, the theoretical runtime is the same. Here you could hypothesise that a performance-oriented framework does not need to do the computation for padded values, and that would be correct. In practice, then, short inputs are faster to process.
This is somewhat countered by caching previous computations (KV caching, again outside the scope for us), and other performance improvements. - Considering the points above, the determining factor for runtime is the length of the output. We know that to produce 100 tokens, we have to run the model 100 times, and to produce 1000 tokens we have to run the model 1000 times. Hence, inputs that cause long runtimes are those that result in a long output. In our exercises we set a fixed output length, while for advanced LLMs the output length is dynamic.
In the end this question does not have a one true answer, but rather aims to test your reasoning.
Describe an issue you encountered while training the models you investigated in your project work, and explain how you solved it.
An open question without a one true answer, but an example answer could be along the lines of
- Trained a computer vision model and observed that train and validation loss diverged, which is symptomatic of overfitting.
- Tried reducing number of convolutional layers, but this negatively impacted the test accuracy
- Implemented this and that augmentation technique, which improved variation in the training data and nearly eliminated the overfitting issue
- Added dropout regularisation before the classification layer, which further reduced overfitting and lead to the final model with the best test accuracy.