Interview Flashcards
(166 cards)
compute a checksum for
precise file deduplication, and compress them for storage. What is meant by this?
checksum gives you a numerical representation of the contents of a file. used to signal integrity of the data in a file.
We can remove duplicate files with same content but different names using this method.
In the context of the described data collection process, computing a checksum means generating a unique, fixed-size numerical value (often represented as a string of letters and numbers) that is calculated from the contents of a file or a piece of data. This checksum serves multiple purposes:
Integrity Verification: The checksum helps verify the integrity of the data or files over time. If the file is altered in any way, even a small change, recomputing the checksum will result in a different value. This allows for easy detection of corruption or unintended changes.
Deduplication: In the process described, the checksum is used for precise file deduplication. By computing and comparing checksums, it is possible to identify and eliminate duplicate files in the dataset, even if the files are named differently. This is because the checksum is based on the file content, not the file name or other metadata.
what data preprocessing steps were done at keysight?
You can do checksum (python code) and deduplication, you can weed out some files based on the version of the code,
Excessively large code files were filtered as they were probably auto generated.
Added file path in the beginning of data of a file
tokenisation done to create chunks and then synthetic dataset creation by querying either llama2 or gpt API depending on data sensitivity
Something I didn’t do during keysight data work was dependency ordering, you read about it later in deepseek coder and it makes a lot of sense and they gave pseudocode for it and their code performance is state of the art and this prolly ahs something to do with it.
Positional Encoding Calculation and code (formula) also the rationale behind it
import numpy as np
def positional_encoding(sentence_length, model_dim):
pos_enc_matrix = np.zeros((sentence_length, model_dim))
for pos in range(sentence_length):
for i in range(0, model_dim, 2):
pos_enc_matrix[pos, i] = np.sin(pos / (10000 ** (i / model_dim)))
pos_enc_matrix[pos, i + 1] = np.cos(pos / (10000 ** (i / model_dim)))
return pos_enc_matrix
Rationale behind the Formula
The usage of a sine function for even indices and a cosine function for odd indices allows the model to capture different phase relationships between sequences of different frequencies.
This approach offers a “hackish” way to prioritize positional information of varying scales. As the formula demonstrates, higher-frequency components influence words that are further along in the sentence, while lower-frequency components emphasize closer proximity.
The specific constant
1
10000
is introduced to prevent the function from saturating.
Positional encoding used in llama?
mu and sigma for layer norm are calculated during training or runtime
mu and sigma calculated during runtime for inference for each data sample and gamma and beta are trained parameters
in the transformer architecture layer norm is done where?
before the activation function after the linear layer after skip connection.
This means the workflow within each sub-block of a transformer is as follows:
- Input from the previous layer
- Addition of the skip connection output
- Layer normalization
- Activation function (if any, depending on the sub-block)
However, some implementations and variations might apply layer normalization before adding the skip connection output. Both methods have been explored in practice, with different effects on training dynamics and performance. The original approach (normalization after the residual connection) is more common and is often preferred because it allows the model to preserve the raw output of each layer before it is normalized, which can be beneficial for learning complex dependencies.
i. [E] What’s the geometric interpretation of the dot product of two vectors?
Multiplication of the length of one vector and the length of the projection of the other
vector onto the first one.
[E] Given a vector u, find vector v of unit length such that the dot product of u and v is
maximum
Given a vector u, the vector v of unit length that maximizes the dot product u · v is the vector that points
in the same direction as u. The vector v can be found by dividing u by its own magnitude, making it a
unit vector.
Give an example of how the outer product can be useful in ML.
The Covariance matrix is a commonly used quantity in Machine Learning algorithms
(eg. PCA). Given a dataset X ∈ R
n×d with n samples and d features, we calculate the (empirical)
covariance as follows:
Cov [X] = 1
n
Xn
i=1
(xi − x¯)(xi − x¯)
T
where ¯x is the mean feature vector: ¯x =
1
n
Pn
i=1 xi
[E] What does it mean for two vectors to be linearly independent?
Two vectors are linearly independent if no scalar multiple of one vector equals the other. This means
they do not lie on the same line through the origin and neither can be formed by scaling the other.
[M] Given two sets of vectors and . How do you
check that they share the same basis?
you first have to make sure A and B have the same dimension. This can be done by converting A and B into its row echelon form.
Singular Value Decomposition (SVD): Decompose A into the product of three matrices: A = U Σ V^T, where U and V are orthogonal matrices, and Σ is a diagonal matrix. The number of non-zero diagonal entries in Σ is the rank of A.
once you have the rank you can then take the r vectors than span row space of both A and B and show that every vector in rows of A is spanned by rows of B and vice versa
[M] Given n vectors, each of d dimensions. What is the dimension of their span?
Given n vectors, each of d dimensions, the dimension of their span depends on their linear
independence. If all vectors are linearly independent, the dimension of their span is min(n, d). If they are
not all linearly independent, the dimension will be less than min(n, d).
Can think of this as a matrix with n columns of d-dim vectors, min(n,d) is the rank
[E] What’s a norm? What is Lnorm?
A norm is a function that assigns a strictly positive length or size to each vector in a vector space,
except for the zero vector, which is given a length of zero.
The L0 norm refers to the number of nonzero elements in a vector,
L1 norm is the sum of the absolute values of the vector elements,
L2 norm
(also known as the Euclidean norm) is the square root of the sum of the squares of the vector elements,
and the
L norm is the maximum absolute value of the elements in the vector.
[M] How do norm and metric differ? Given a norm, make a metric. Given a metric, can we make a norm?
Ans. A metric measures distances between pairs of things while a norm measures the size of a single item. Metrics can be defined on pretty much anything, while the notion of a norm applies only to vector spaces: the very definition of a norm requires that the things measured by the norm could be added and scaled. If you have a norm, you can define a metric by saying that the distance between a and b is the size of a - b.
On the other hand, if you have a metric you can’t usually define a norm.
[E] Why do we say that matrices are linear transformations?
Matrices represent linear transformations because they map linear combinations of vectors to other linear
combinations.
Specifically, for a matrix M and vectors u and v, and a scalar c, the following properties hold true, which are
the properties of linear transformations:
M(u + v) = Mu + Mv
Applying M to the sum of two vectors is the same as summing the results of applying M to each
vector individually.
M(cu) = c(Mu)
Applying M to a scalar multiple of a vector is the same as multiplying the result of applying M by
that same scalar.
These properties demonstrate that matrix multiplication exhibits the key aspects of a linear transformation:
1. Additivity - Applying the transformation to vector sums gives the sum of individually transformed
vectors
2. Homogeneity - Scalars distribute across the transformation
Thus, matrices and matrix multiplication inherently represent and operate as linear transformations
between vector spaces.
What’s the inverse of a matrix? Do all matrices have an inverse? Is the inverse of a matrix
always unique?
The inverse of a matrix A is another matrix A^-1 such that AA^-1 = A^-1A = I, where I is the identity matrix.
Not all matrices have an inverse; only square matrices that are non-singular (with a non-zero determinant)
have an inverse. When a matrix has an inverse, it is always unique.`
[E] What does the determinant of a matrix represent?
The determinant of a matrix can be interpreted as a scaling factor for the transformation that the matrix
represents. Geometrically, it represents the volume scaling factor of the linear transformation described by
the matrix, including whether the transformation preserves or reverses the orientation of the space.
[E] What happens to the determinant of a matrix if we multiply one of its rows by a scalar t×R?
Multiplying a row of a matrix by a scalar t scales the determinant of the matrix by that scalar. So if the
original determinant was d, the new determinant will be t×d.
[M] A 4×4 matrix has four eigenvalues 3,3,2,−1. What can we say about the trace and the
determinant of this matrix?
The trace of the matrix, which is the sum of its eigenvalues, would be 3+3+2−1=7. The determinant of the
matrix, which is the product of its eigenvalues, would be 3×3×2×(−1)=−18
what is the determinant of a matrix with linearly dependent rows or columns
0
[M] What’s the difference between the covariance matrix A^TA and the Gram matrix AA^T?
Suppose A ∈ Rn×d
, corresponding to n samples with each having d features. Then, the Covariance matrix AT A ∈ Rd×d
captures the ”similarity” between features, whereas the Gram matrix
AAT ∈ Rn×n captures the ”similarity” between samples.
Given A∈Rn×m and b∈Rn
Find x such that: Ax=b
When does this have a unique solution?
Why is it when A has more columns than rows, Ax=b has multiple solutions?
Given a matrix A with no inverse. How would you solve the equation Ax=b?
Unique Solution
Square and Full Rank: If A is square (n = m) and has full rank (rank(A) = m = n), then A is invertible. The equation Ax = b has a unique solution given by x = A^(-1)b.
Non-Square but Full Column Rank: If A is not square (m ≠ n), but rank(A) = m (full column rank) and m < n (tall matrix), a left inverse A_L of A exists such that A_L A = I. This configuration implies Ax = b has a unique solution when b is in the range (column space) of A. The solution can be given by x = A_L b. In this scenario, A_L = (A^T A)^(-1)A^T and there is no null space of A (nullity(A) = 0) because all columns are linearly independent.
Multiple or No Solutions
Overdetermined System (m > n): If A is a wide matrix (more columns than rows), the rank of A could be at most n, and the null space of A (nullity(A)) has dimension m - n (assuming rank(A) = n). This situation typically means:
No Solution: If b is not in the column space of A, then Ax = b has no solution.
Infinitely Many Solutions: If b is in the column space of A, there are infinitely many solutions because there are free variables associated with the non-trivial null space of A.
What is the pseudoinverse and how to calculate it?
Method 1: Singular Value Decomposition (SVD)
The most reliable method for computing the pseudoinverse of any matrix, whether square or rectangular, full rank or rank-deficient, is using its Singular Value Decomposition (SVD). The SVD of matrix A is given by:
A = U Σ V^T
where:
U is an n×n orthogonal matrix whose columns are the left singular vectors of A.
V is an m×m orthogonal matrix whose columns are the right singular vectors of A.
Σ is an n×m diagonal matrix with non-negative real numbers on the diagonal, known as the singular values.
The pseudoinverse A+ is then calculated as:
A+ = V Σ+ U^T
Here, Σ+ is obtained by taking the reciprocal of each non-zero singular value in Σ, and transposing the matrix.
Method 2: Using Matrix Transpose and Inversion
For matrices that are either full row rank or full column rank, the pseudoinverse can also be computed directly using matrix transposes and inversion:
Full Column Rank (m≤n and rank(A)=m):
A+ = (A^T A)^(-1) A^T
Full Row Rank (m≥n and rank(A)=n):
A+ = A^T (A A^T)^(-1)
Note: These formulas assume that A has full column or row rank, respectively.
Do AAT or ATA depending on which gives you a full rank matrix, that is then invertible and use that to find pseudoinverse
What does the derivative represent?
Answer: The derivative of a function measures the sensitivity to change in the function output with
respect to a change in the input.
Moreover, when it exists, the derivative at a given point is the slope of the tangent line to the graph
of the function at that point. The tangent line is the best linear approximation of the function at
that input value. This is the reason why in gradient descent we (slowly) move in the (negative)
direction of the derivative
What’s the difference between derivative, gradient, and Jacobian?
When f: R → R, we calculate the derivative df/dx.
When f: R^n → R, we calculate the gradient:
∇f = [∂f/∂x1, ∂f/∂x2,…, ∂f/∂xn]
When f: R^n → R^m, we calculate the Jacobian (an mxn matrix):
Jac(f) = [
∂f1/∂x1 … ∂f1/∂xn
…
∂fm/∂x1 … ∂fm/∂xn
]