numpy Flashcards

Question 1

Q

Question 1 [5 marks]

```python
import numpy as np
a = np.array([[1, 2], [3, 4], [5, 6]])
print(a.ndim, a.shape, a.size)
~~~

Explain the output of the code above. Additionally, describe how changing the shape of a using .reshape() could impact its rank and usability in matrix operations.

Answer

A

a.ndim returns 2 (number of axes or dimensions).
a.shape returns (3, 2), indicating 3 rows and 2 columns.
a.size returns 6, the total number of elements.
Reshaping a (e.g., to (2, 3)) does not change its data but modifies how it’s accessed in memory, affecting matrix operations such as dot products or broadcasting compatibility.

Question 2

Q

Question 2 [4 marks]

```python
a = np.array([[1, 2], [3, 4]])
b = np.array([[2, 0], [1, 3]])
print(a * b)
print(np.matmul(a, b))
~~~

a. What is the difference between a * b and np.matmul(a, b)?
b. Which one performs element-wise multiplication, and which performs matrix multiplication?

Answer

A

a * b performs element-wise multiplication: [[12, 20], [31, 43]] = [[2, 0], [3, 12]]
np.matmul(a, b) performs matrix multiplication using linear algebra rules: [[12+21, 10+23], [32+41, 30+43]] = [[4, 6], [10, 12]]

Question 3

Q

Question 3 [4 marks]
Explain what broadcasting is and show how the following operation works:

```python
a = np.array([[1], [2], [3]])
b = np.array([10, 20, 30])
result = a + b
~~~

Include the intermediate shapes and resulting array.

Answer

A

Broadcasting automatically expands arrays so their shapes are compatible for element-wise operations.
- Shape of a is (3,1); shape of b is (3,)
- b is reshaped to (1, 3), then broadcasted to (3,3)
Result:
[[1+10, 1+20, 1+30],
[2+10, 2+20, 2+30],
[3+10, 3+20, 3+30]] = [[11, 21, 31], [12, 22, 32], [13, 23, 33]]

Question 4

Q

Question 4 [5 marks]
Discuss the purpose and potential pitfalls of the following NumPy methods:
- np.zeros
- np.ones
- np.empty
- np.full
When would you prefer one over the others in scientific computing?

Answer

A

np.zeros: Initializes an array filled with zeros. Useful for initializing arrays in algorithms where zero is a natural starting point.
np.ones: Similar to np.zeros, but initializes with ones. Used when starting with a uniform non-zero value is necessary.
np.empty: Creates an uninitialized array. It’s faster than zeros and ones, but the values are unpredictable, requiring caution when used.
np.full: Initializes an array with a specified value, useful when a constant value is needed for initialization.
Preference depends on whether uninitialized data (np.empty) is acceptable or a specific value (np.full) is needed.

Question 5

Q

What are the key differences between Python lists and NumPy arrays? Use at least three distinct points that affect performance or usability

Answer

A

Answer:
- Memory Efficiency: NumPy arrays are more memory-efficient than Python lists, especially for large datasets, as they store data in contiguous blocks of memory.
- Performance: NumPy operations are vectorized, allowing for faster computation compared to the iterative nature of Python lists.
- Functionality: NumPy arrays support a wide variety of mathematical operations and broadcasting, which Python lists cannot directly handle.

Question 6

Q

You are given:

```python
a = np.array([[1, 2], [3, 4]])
print(a[:, 1])
~~~

Explain what the slicing syntax a[:, 1] does, and how it differs from a[1, :].

Answer

A

Answer:
- a[:, 1] selects all rows (:) of the second column (1), returning a 1D array: [2, 4].
- a[1, :] selects the second row (1), returning a 1D array: [3, 4].

Question 7

Q

Question 7 [6 marks]
You are building a machine learning preprocessing pipeline. You want to normalize input features stored in a 2D NumPy array. Write a function that:
- Normalizes each feature column-wise (zero mean, unit variance).
- Handles missing values (NaN) by column mean imputation.

Answer

A

Answer:

```python
def normalize_data(data):
# Impute NaN with column means
col_means = np.nanmean(data, axis=0)
data = np.where(np.isnan(data), col_means, data)

# Normalize column-wise
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
normalized_data = (data - mean) / std

return normalized_data ~~~

This function first imputes missing values with the mean of each column and then normalizes each column.

Question 8

Q

Question 8 [3 marks]
Given a flattened array of pixel values, describe how you would reshape it into a 2D grayscale image of dimensions (64, 64) using NumPy. What must you check before reshaping?

Answer

A

Answer:
To reshape a flattened array pixels of length 4096 (64 * 64) into a 2D image:

```python
image = pixels.reshape(64, 64)
~~~

Before reshaping, ensure the length of the array matches the required dimensions (64 * 64 = 4096). If not, an error will occur.

Question 9

Q

Question 9 [4 marks]
Describe how NumPy’s axis argument works using examples with:
- np.sum()
- np.mean()
Then, explain what would happen if you incorrectly specify the axis in a 2D array operation.

Answer

A

Answer:
- np.sum(a, axis=0) sums the columns of a, while np.sum(a, axis=1) sums the rows.
- np.mean(a, axis=0) calculates the column-wise mean, while np.mean(a, axis=1) calculates the row-wise mean.
Incorrectly specifying an axis might cause incorrect summation or averaging, or result in dimension errors.

Question 10

Q

Question 10 [4 marks]

```python
a = np.array([1, 2, 3])
b = np.array([[1, 2], [3, 4]])
print(a + b)
~~~

Predict the output or describe the error. Why does this operation fail or succeed? How can it be modified to work as intended?

Answer

A

Answer:
This operation will fail because a has shape (3,) and b has shape (2,2), which are not compatible for broadcasting.
To fix it, either reshape a to (2, 2) or adjust the shapes to match, like:

```python
a = np.array([[1, 2], [3, 4]])
print(a + b)
~~~

Question 11

Q

Question 11 [5 marks]
Explain how NumPy’s vectorization improves performance compared to regular Python loops. Provide a code example that performs the same operation using both methods and compare execution times.

Answer

A

Answer:
Vectorization in NumPy allows you to perform operations on entire arrays at once, avoiding Python’s slow for-loops.
Example:

```python
# Without vectorization
result = 0
for i in range(1000):
result += i

With vectorization
result = np.sum(np.arange(1000))
~~~

The second method is much faster due to efficient underlying C operations.

Question 12

Q

Question 12 [3 marks]
You are profiling a NumPy-based data processing pipeline and find a performance bottleneck. Suggest three optimization techniques you can apply (besides vectorization) to make your code run faster or use less memory.

Answer

A

Answer:
- Use inplace operations (e.g., a.fillna() instead of a = a.fillna()).
- Use float32 instead of float64 for large numerical datasets to reduce memory usage.
- Use np.memmap() for handling large arrays that do not fit into memory.

Question 13

Q

Question 13
You are tasked with implementing a data preprocessing pipeline for an image classification model. The images are stored in a 3D NumPy array with shape (num_images, height, width). Write a function to:

Convert all images to grayscale.

Normalize each pixel value to the range [0, 1].

Reshape the images into a 2D array for input to a model.
[6 marks]

Answer

A

def preprocess_images(images):
# Convert images to grayscale by averaging the color channels (assuming RGB)
grayscale_images = np.mean(images, axis=-1)

# Normalize pixel values to range [0, 1]
normalized_images = grayscale_images / 255.0

# Reshape images to 2D array (num_images, height*width)
reshaped_images = normalized_images.reshape(normalized_images.shape[0], -1)

return reshaped_images

Question 14

Q

Consider a dataset with two features: age and income. Write a function that:

Standardizes the data (zero mean, unit variance).

Handles missing values by replacing them with the feature median.
[5 marks]

Answer

A

def standardize_data_with_median(data):
# Handle missing values by replacing NaNs with the median of each column
medians = np.nanmedian(data, axis=0)
data = np.where(np.isnan(data), medians, data)

# Standardize each feature (zero mean, unit variance)
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
standardized_data = (data - mean) / std

return standardized_data

Question 15

Q

You are given a dataset of customer spending patterns, represented as a 2D NumPy array where rows correspond to customers and columns correspond to products. Write a function to:

Apply logarithmic scaling to all values in the dataset (ensuring no zero or negative values).

Apply min-max scaling to each column to scale values between 0 and 1.
[6 marks]

Answer

A

def scale_data(data):
# Apply logarithmic scaling (avoid log(0) by adding a small constant)
log_scaled_data = np.log(data + 1e-10)

# Apply min-max scaling to each column
min_vals = np.min(log_scaled_data, axis=0)
max_vals = np.max(log_scaled_data, axis=0)
min_max_scaled_data = (log_scaled_data - min_vals) / (max_vals - min_vals)

return min_max_scaled_data

Question 16

Q

The following code is giving an error. Debug the code and explain what went wrong:

python
Copy
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([[7, 8], [9, 10], [11, 12]])

result = np.dot(a, b, c)
What is the issue with the code, and how can you fix it?
[4 marks]

Answer

Study These Flashcards

A

The error is in the line np.dot(a, b, c). The np.dot() function only takes two arguments, but there are three in the code.

Fix: You need to separate the two operations or use np.tensordot() for higher-dimensional dot products.

result = np.dot(a, b) # Computes dot product of a and b
result2 = np.dot(result, c) # Compute dot product with c if intended

Question 17

Q

You are trying to sum elements of a 2D array along a specific axis but getting incorrect results.

python
Copy
a = np.array([[1, 2], [3, 4], [5, 6]])
print(np.sum(a, axis=1))
Explain what this does and what you expect as the output. Then, predict the output of np.sum(a, axis=0).
[4 marks]

Answer

Study These Flashcards

A

np.sum(a, axis=1): This sums the elements along rows (axis 1). The result will be the sum of each row:

[1+2, 3+4, 5+6] = [3, 7, 11]

np.sum(a, axis=0): This sums the elements along columns (axis 0), i.e., sum each column:

[1+3+5, 2+4+6] = [9, 12]

Question 18

Q

Explain the concept of vectorization and how it improves performance over traditional Python loops. Provide an example that sums the elements of an array both using a for-loop and vectorized code.
[5 marks]

Answer

Study These Flashcards

A

Vectorization refers to performing operations on entire arrays (or slices of arrays) at once, without using explicit loops. This is much faster in NumPy because it utilizes optimized C code under the hood.

Example:
# Without vectorization (using a for-loop)
arr = np.array([1, 2, 3, 4])
sum_without_vectorization = 0
for num in arr:
sum_without_vectorization += num

With vectorization (using NumPy directly)
sum_with_vectorization = np.sum(arr)

Question 19

Q

You need to optimize a NumPy-based code that is processing large datasets and is slow. What strategies would you use to speed up the code and reduce memory usage? List at least three optimization techniques.
[5 marks]

Answer

Study These Flashcards

A

1) Use float32 instead of float64: For large datasets, using float32 can significantly reduce memory usage without losing much precision.
data = np.array(large_data,

dtype=np.float32)

2) Use np.memmap() for memory-mapped files: This allows you to work with large arrays that do not fit into memory by reading from disk only the parts needed.

data = np.memmap(‘data.dat’, dtype=’float32’, mode=’r’, shape=(large_shape))

3) data = np.memmap(‘data.dat’, dtype=’float32’, mode=’r’, shape=(large_shape))

a.fill(0) # Inplace operation to fill the array with 0

numpy Flashcards

(19 cards)