ML Coding Qs Flashcards

Question 1

Q

Write the NumPy code for K-Means Clustering

Question 2

Q

Write the NumPy code for PCA

Question 3

Q

What is Gini impurity, its formula, and NumPy implementation? Finally, what is the weighed Gini of a split in a decision tree?

Answer

A

Gini impurity measures the probability of misclassifying a randomly chosen sample if labeled according to the node’s class distribution. A pure node scores 0; a maximally mixed binary node scores 0.5.

Formula: G = 1 - Σ pᵢ²
where pᵢ is the proportion of samples belonging to class i.

def gini(y):
    # compute proportion of each class
    p = np.bincount(y) / len(y)
    # apply formula: 1 - sum(pi^2)
    return 1 - np.sum(p ** 2)

Sanity checks:

gini([0,0,0,0]) → 0.0 (pure node)
gini([0,0,1,1]) → 0.5 (maximally mixed)

The weighted Gini of a split is just:
G_split = (n_left/n) * G_left + (n_right/n) * G_right

Question 4

Q

You have a NumPy array [1.4, 10.5, 3.5, 2.1]. How can you calculate the thresholds between consecutive values in the array?

Answer

A

Use numpy.unique(array) to sort the array ascending, and use advanced indexing to get the consecutive pairs in the same index of two arrays. Then add the two arrays and divide. Here’s the code:

thresholds = (np.unique(values)[:-1] + np.unique(values)[1:]) / 2

Question 5

Q

Write the NumPy code for Decision Trees.

Answer

A

import numpy as np

class Node:
    def \_\_init\_\_(self, feature=None, threshold=None, left=None, 
                 right=None, value=None):
        self.feature = feature      # index of feature to split on
        self.threshold = threshold  # threshold value to split on
        self.left = left            # left child Node
        self.right = right          # right child Node
        self.value = value          # if leaf, stores majority class

    def is_leaf(self):
        # a node is a leaf if it has a value (i.e. no children)
        return self.value is not None

class DecisionTree:
    def \_\_init\_\_(self, max_depth=10, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None

    def gini(self, y):
        # np.bincount counts occurrences of each class (index=class, value=count)
        # dividing by len(y) converts counts to proportions p_i
        p = np.bincount(y) / len(y)
        # gini = 1 - sum(p_i^2), measures probability of misclassification
        return 1 - np.sum(p ** 2)

    def best_split(self, X, y):
        best_gini = float('inf')
        best_feature, best_threshold = None, None

        # iterate over every feature (column of X)
        for feature in range(X.shape[1]):
            # extract 1D array of all sample values for this feature
            values = X[:, feature]

            # np.unique returns sorted unique values
            # consecutive midpoints avoid empty splits and generalize better
            # arr[:-1] = all but last, arr[1:] = all but first -> consecutive pairs
            # averaging elementwise gives midpoints between consecutive unique values
            thresholds = (np.unique(values)[:-1] + np.unique(values)[1:]) / 2

            for threshold in thresholds:
                # boolean mask selects samples where feature <= threshold (left)
                # and > threshold (right)
                left_y  = y[values <= threshold]
                right_y = y[values >  threshold]

                len_y       = len(y)
                len_left_y  = len(left_y)
                len_right_y = len(right_y)

                # skip if either split is empty
                if not len_left_y or not len_right_y:
                    continue

                # weighted gini: weight each child's gini by its proportion of samples
                # G_split = (n_left/n)*G_left + (n_right/n)*G_right
                gini_split = (len_left_y/len_y)*self.gini(left_y) + \
                             (len_right_y/len_y)*self.gini(right_y)

                # greedily track the split with the lowest weighted gini
                if gini_split < best_gini:
                    best_gini      = gini_split
                    best_feature   = feature
                    best_threshold = threshold

        return best_feature, best_threshold

    def build_tree(self, X, y, depth=0):
        # stopping condition 1: max depth reached
        # stopping condition 2: np.unique(y) returns unique classes;
        #   if only 1 exists, node is pure
        if depth == self.max_depth or len(np.unique(y)) == 1:
            # np.bincount(y) counts each class, np.argmax returns the majority class index
            return Node(value=np.argmax(np.bincount(y)))

        best_feature, best_threshold = self.best_split(X, y)

        # stopping condition 3: no valid split found (e.g. all features constant)
        if best_feature is None:
            return Node(value=np.argmax(np.bincount(y)))

        # X[:, best_feature] extracts the best feature column as a 1D array
        values = X[:, best_feature]

        # boolean masks split X and y into left/right subsets simultaneously
        left_X,  left_y  = X[values <= best_threshold], y[values <= best_threshold]
        right_X, right_y = X[values >  best_threshold], y[values >  best_threshold]

        # recurse on each subset, incrementing depth
        left_child  = self.build_tree(left_X,  left_y,  depth + 1)
        right_child = self.build_tree(right_X, right_y, depth + 1)

        # return internal node storing split info and children
        return Node(feature=best_feature, threshold=best_threshold,
                    left=left_child, right=right_child)

    def fit(self, X, y):
        self.root = self.build_tree(X, y)

    def _traverse(self, node, x):
        # base case: leaf node, return its majority class value
        if node.is_leaf():
            return node.value
        # route sample left or right based on feature value vs threshold
        if x[node.feature] <= node.threshold:
            return self._traverse(node.left, x)
        else:
            return self._traverse(node.right, x)

    def predict(self, X):
        # iterate over each row of X (each sample), traverse tree, collect predictions
        # np.array wraps the list comprehension result into a NumPy array
        return np.array([self._traverse(self.root, x) for x in X])

Question 6

Q

Write the NumPy code for Random Forests (Note this assumes the Decision Tree class exists)

Question 7

Q

Write the NumPy code for Gradient Descent Linear Regression, both training and prediction.

Question 8

Q

Write the NumPy code for Closed Form OLS Linear Regression, both training and prediction.

Question 9

Q

Write the NumPy code for GD Logistic Regression, both training and prediction.

Question 10

Q

Define entropy. Then, provide the Numpy implementation.

Question 11

Q

Write the implementation of Information Gain using Entropy using Numpy.

Question 12

Q

Write the Numpy code for a GBM implementation (assuming that a DecisionTree class is implemented and available).

Question 13

Q

Design a data structure that follows the constraints of a Least Recently Used (LRU) cache.

Implement the LRUCache class:

LRUCache(int capacity) Initialize the LRU cache with positive size capacity.

int get(int key) Return the value of the key if the key exists, otherwise return -1.

void put(int key, int value) Update the value of the key if the key exists. Otherwise, add the key-value pair to the cache. If the number of keys exceeds the capacity from this operation, evict the least recently used key.

The functions get and put must each run in O(1) average time complexity.

Question 14

Q

Write the Numpy code for a Embedding Layer.

ML Coding Qs Flashcards

(14 cards)