Write the NumPy code for K-Means Clustering
Write the NumPy code for PCA
What is Gini impurity, its formula, and NumPy implementation? Finally, what is the weighed Gini of a split in a decision tree?
Gini impurity measures the probability of misclassifying a randomly chosen sample if labeled according to the node’s class distribution. A pure node scores 0; a maximally mixed binary node scores 0.5.
Formula: G = 1 - Σ pᵢ²
where pᵢ is the proportion of samples belonging to class i.
def gini(y):
# compute proportion of each class
p = np.bincount(y) / len(y)
# apply formula: 1 - sum(pi^2)
return 1 - np.sum(p ** 2)Sanity checks:
gini([0,0,0,0]) → 0.0 (pure node)
gini([0,0,1,1]) → 0.5 (maximally mixed)
The weighted Gini of a split is just:
G_split = (n_left/n) * G_left + (n_right/n) * G_right
You have a NumPy array [1.4, 10.5, 3.5, 2.1]. How can you calculate the thresholds between consecutive values in the array?
Use numpy.unique(array) to sort the array ascending, and use advanced indexing to get the consecutive pairs in the same index of two arrays. Then add the two arrays and divide. Here’s the code:
thresholds = (np.unique(values)[:-1] + np.unique(values)[1:]) / 2
Write the NumPy code for Decision Trees.
import numpy as np
class Node:
def \_\_init\_\_(self, feature=None, threshold=None, left=None,
right=None, value=None):
self.feature = feature # index of feature to split on
self.threshold = threshold # threshold value to split on
self.left = left # left child Node
self.right = right # right child Node
self.value = value # if leaf, stores majority class
def is_leaf(self):
# a node is a leaf if it has a value (i.e. no children)
return self.value is not None
class DecisionTree:
def \_\_init\_\_(self, max_depth=10, min_samples_split=2):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.root = None
def gini(self, y):
# np.bincount counts occurrences of each class (index=class, value=count)
# dividing by len(y) converts counts to proportions p_i
p = np.bincount(y) / len(y)
# gini = 1 - sum(p_i^2), measures probability of misclassification
return 1 - np.sum(p ** 2)
def best_split(self, X, y):
best_gini = float('inf')
best_feature, best_threshold = None, None
# iterate over every feature (column of X)
for feature in range(X.shape[1]):
# extract 1D array of all sample values for this feature
values = X[:, feature]
# np.unique returns sorted unique values
# consecutive midpoints avoid empty splits and generalize better
# arr[:-1] = all but last, arr[1:] = all but first -> consecutive pairs
# averaging elementwise gives midpoints between consecutive unique values
thresholds = (np.unique(values)[:-1] + np.unique(values)[1:]) / 2
for threshold in thresholds:
# boolean mask selects samples where feature <= threshold (left)
# and > threshold (right)
left_y = y[values <= threshold]
right_y = y[values > threshold]
len_y = len(y)
len_left_y = len(left_y)
len_right_y = len(right_y)
# skip if either split is empty
if not len_left_y or not len_right_y:
continue
# weighted gini: weight each child's gini by its proportion of samples
# G_split = (n_left/n)*G_left + (n_right/n)*G_right
gini_split = (len_left_y/len_y)*self.gini(left_y) + \
(len_right_y/len_y)*self.gini(right_y)
# greedily track the split with the lowest weighted gini
if gini_split < best_gini:
best_gini = gini_split
best_feature = feature
best_threshold = threshold
return best_feature, best_threshold
def build_tree(self, X, y, depth=0):
# stopping condition 1: max depth reached
# stopping condition 2: np.unique(y) returns unique classes;
# if only 1 exists, node is pure
if depth == self.max_depth or len(np.unique(y)) == 1:
# np.bincount(y) counts each class, np.argmax returns the majority class index
return Node(value=np.argmax(np.bincount(y)))
best_feature, best_threshold = self.best_split(X, y)
# stopping condition 3: no valid split found (e.g. all features constant)
if best_feature is None:
return Node(value=np.argmax(np.bincount(y)))
# X[:, best_feature] extracts the best feature column as a 1D array
values = X[:, best_feature]
# boolean masks split X and y into left/right subsets simultaneously
left_X, left_y = X[values <= best_threshold], y[values <= best_threshold]
right_X, right_y = X[values > best_threshold], y[values > best_threshold]
# recurse on each subset, incrementing depth
left_child = self.build_tree(left_X, left_y, depth + 1)
right_child = self.build_tree(right_X, right_y, depth + 1)
# return internal node storing split info and children
return Node(feature=best_feature, threshold=best_threshold,
left=left_child, right=right_child)
def fit(self, X, y):
self.root = self.build_tree(X, y)
def _traverse(self, node, x):
# base case: leaf node, return its majority class value
if node.is_leaf():
return node.value
# route sample left or right based on feature value vs threshold
if x[node.feature] <= node.threshold:
return self._traverse(node.left, x)
else:
return self._traverse(node.right, x)
def predict(self, X):
# iterate over each row of X (each sample), traverse tree, collect predictions
# np.array wraps the list comprehension result into a NumPy array
return np.array([self._traverse(self.root, x) for x in X])Write the NumPy code for Random Forests (Note this assumes the Decision Tree class exists)
Write the NumPy code for Gradient Descent Linear Regression, both training and prediction.
Write the NumPy code for Closed Form OLS Linear Regression, both training and prediction.
Write the NumPy code for GD Logistic Regression, both training and prediction.
Define entropy. Then, provide the Numpy implementation.
Write the implementation of Information Gain using Entropy using Numpy.
Write the Numpy code for a GBM implementation (assuming that a DecisionTree class is implemented and available).
Design a data structure that follows the constraints of a Least Recently Used (LRU) cache.
Implement the LRUCache class:
LRUCache(int capacity) Initialize the LRU cache with positive size capacity.
int get(int key) Return the value of the key if the key exists, otherwise return -1.
void put(int key, int value) Update the value of the key if the key exists. Otherwise, add the key-value pair to the cache. If the number of keys exceeds the capacity from this operation, evict the least recently used key.
The functions get and put must each run in O(1) average time complexity.
Write the Numpy code for a Embedding Layer.