ML-Final Flashcards Preview

ML > ML-Final > Flashcards

Flashcards in ML-Final Deck (52):

What is threshold logical unit

Simple model of a neuron

Each input value is multiplied with the corresponding weight value, and these weighted values are then summed.

If the weighted summed input is larger than a certain threshold value, then the output is set to one, and zero otherwise


What is weight parameter

representing the ‘strength’ of a connection


What is a perceptron

where the output is calculated from the weighted summed input with a activation function(gain function, transfer function, output function, activation function. )


Give examples of gain function, transfer function, output function, activation function.

??? sigmoid, tanh


Why we add bias term to perceptron

A bias allows a perceptron to shift the prediction to better fit.


Similarities of SVM and perceptron ?

Linear SVM is a special case of a perceptron


What is the difference between Deep Learning and SVM?

SVM solve the optimization problem with specific transformations of the feature space.

Deep learning will aim at learning the appropriate transformations.


What is delta term(delta rule?)

δ = (y(i) − y)y(1 − y)
Delta rule is a gradient descent learning rule for updating the weights of the inputs


Why a multilayer feedforward network is a universal function approximator

there is guaranteed to be a neural network so that for every possible input, (x) the value f(x) is output from the network

given enough hidden nodes, any functions can be approximated with arbitrary precision by these networks


What is error-back-propagation or backpropagation

calculation of the gradient proceeds backwards through the network, gradient of the final layer of weights being calculated first and the gradient of the first layer of weights being calculated last


Batch, mini-batch and online

online avoiding local minima, mini-batch large datasets. Batch high memory space.


No free lunch" theorem

no one model that works best for every problem.

The assumptions of a great model for one problem may not hold for another problem


What is cross entropy? (The negative log probability) what is it used for ?

The negative log probability of the given label times the current model(probability distribution)

H(p,q) = − sum[ p(y) log q(y) ]

q: true nature of data
p: The neural network model represents the probability p(y|x; w)

Derive learning rule


KL-divergence, what is it equivalent to? What are they related to? What are they used in neural network ?

related to cross entropy
H(p, q) = H(p) + KL(p||q)

minimizing the cross entropy is equivalent to minimizing the KL-divergence

both are closely related to the maximum (log) likelihood principle

use to generate learning rule?


What is softmax function, why and where does it used in neural network?

Softmax function is a generalization of the logistic function that "squeeze" the output in the range (0, 1)

It is used to highlight the largest values and suppress values which are significantly below the maximum value in a neural network.

final layer of a neural network


How neural networks are related to probabilistic regression?

cross entropy, KL-divergence


What is the relationship of maximize the log probability and cross entropy

That is, we want to maximize the log probability of the data given the labels. Since the cross entropy is the negative of this, maximizing the log probability of the data given the labels is equivalent of minimizing the cross entropy


What is deep learning?

Deep learning basically refer to neural networks with many layers


What is a filter in CNN?

It is a vector describing a pattern


What is convolution?

Convolution is the operation of multiplying and adding while shifting the filter


What is a stride in CNN?

It is how many steps you shift the filter in each iteration of the convolution


What is a pooling operation in convolutional neural networks and why is this operation important?

Pooling is taking the average or the maximum of the previous output in a certain area of the filtered image.

It compress down the image and high-level representation.

This is usually called downsampling, this operation is important is because it reduce the dimensionality of features and computational cost .

Also helps to prevent overfitting.


Briefly explain `dropout’ and why it is used in deep networks. 

Dropout: Randomly (e.g. p=0.5) ignoring hidden node for a specific input during learning. temporarily turned off

The reason we use it is that it is a regularization technique that helps to prevent overfitting.


Sparse representation, comprised representation and fully distributed representation.



What is autoencoder? Why it is useful?

An autoencoder is a neural network that tries to reconstruct its input.

it is a feature extraction algorithm it helps us find a representation for our data


What is the relation between Ridge regression and a Gaussian prior?

Ridge regression use the L2 regularization, and the L2 regularization is equivalent to a Gaussian prior.


What is batch normalization ?

Batch normalization: normalize the input to each hidden layer over each mini-batch


What is skip connections in neural network?

the process to skip the convolutional layers in the network


What is Recurrent Neural Networks and Where is the term ‘recurrent’ comes from?? What is used for ?

Recurrent Neural Networks perform the same task for every element of a sequence.

It used for sequence processing eg, for machine translation


Explain what is backpropagation-though-time in RNN?



What is Gated Neural Network?

A gated recurrent network has an extra memory state(namely gated) that will be carried from the current step to the next step.

A forgetting gate and a write gate can modify its value.

An example of such neural network would be LSTM (Long Short Term Memory ) or a gated recurrent unit (GRU) s


What is Boltzmann machine ? What is the challenge of it?

Special form of recurrent network that the connections between nodes are symmetric

The challenge is finding practical training rules


What is reinforcement learning (RL)?

A learning system with action and reward.


In reinforcement learning, what is a policy? 

A policy in reinforcement learning is use to determine the action to take in each state.


What are the RL challenges ?

1.Credit assignment

2.Exploration versus exploitation trade-off.


What is Markov condition ? or the Markov Decision Process? (same as transition function in RL)

transition function only depend on the previous state and the intended action from the previous state


What is Reward function in RL?

rt+1 = ρ(st, at)
returns the value of reward when the agent is entering state st+1 by taking action at from state st


What is Policy in RL?

A policy in reinforcement learning is use to determine the action to take in each state. Policy: at = π(st)


Value function and Optimal Value function

Reward and disconnect reward
this functions tells us how good is action a in state s

Value function (state-action): Qπ(s, a)
Value function (state): V π (s) = Qπ (s, π(s))
Optimal Value function: V ∗(s) = max Q∗(s, a),


Optimal policy

Optimal policy: π∗(s) = arg max Q∗(s, a).


What is Model-based Reinforcement Learning ?

we assume that the agent has a model of the environment and its behaviour by knowing the reward function ρ(s, a) and the transfer functions τ (s, a).


What is Model-free Reinforcement Learning ?









Explain the difference between the SARSA and Q-Learning algorithm. 

SARSA is an on-policy approach of RL.

in the part where γ Q (st+1, at+1) we know that its use the previous policy to generate the next policy.

Namely State-Action-Reward-State-Action.

Q-learning is an off-policy approach in RL.

γ m a xaQ (si+1, a′) is the part that is different than ASRSA.
Here, we do not limit the how the next action is selected which means the policy generated in Q-learning is not depends on the previous policy.


epsilon-greedy policy ?



What is the difference between on-policy and off policy?



basic Bellman equation?



What can we learn about SARSA and Q-Learning ?

SARSA will avoid the mistake due to exploration, and Q- learning still have the ability to learn with different exploring policy.


What is reward function in RL? What is transfer function ?

reward function ρ(s, a) and the transfer functions τ (s, a).


What is non-Markovian condition ?

non-Markovian condition would be the case in which the next state depends on a series of previous states and actions


Temporal difference ?