numpy Flashcards
(19 cards)
- Question 1 [5 marks]
```python
import numpy as np
a = np.array([[1, 2], [3, 4], [5, 6]])
print(a.ndim, a.shape, a.size)
~~~
Explain the output of the code above. Additionally, describe how changing the shape of a
using .reshape()
could impact its rank and usability in matrix operations.
-
a.ndim
returns 2 (number of axes or dimensions). -
a.shape
returns (3, 2), indicating 3 rows and 2 columns. -
a.size
returns 6, the total number of elements.
Reshapinga
(e.g., to (2, 3)) does not change its data but modifies how it’s accessed in memory, affecting matrix operations such as dot products or broadcasting compatibility.
- Question 2 [4 marks]
```python
a = np.array([[1, 2], [3, 4]])
b = np.array([[2, 0], [1, 3]])
print(a * b)
print(np.matmul(a, b))
~~~
a. What is the difference between a * b
and np.matmul(a, b)
?
b. Which one performs element-wise multiplication, and which performs matrix multiplication?
-
a * b
performs element-wise multiplication: [[12, 20], [31, 43]] = [[2, 0], [3, 12]] -
np.matmul(a, b)
performs matrix multiplication using linear algebra rules: [[12+21, 10+23], [32+41, 30+43]] = [[4, 6], [10, 12]]
- Question 3 [4 marks]
Explain what broadcasting is and show how the following operation works:
```python
a = np.array([[1], [2], [3]])
b = np.array([10, 20, 30])
result = a + b
~~~
Include the intermediate shapes and resulting array.
Broadcasting automatically expands arrays so their shapes are compatible for element-wise operations.
- Shape of a
is (3,1); shape of b
is (3,)
- b
is reshaped to (1, 3), then broadcasted to (3,3)
Result:
[[1+10, 1+20, 1+30],
[2+10, 2+20, 2+30],
[3+10, 3+20, 3+30]] = [[11, 21, 31], [12, 22, 32], [13, 23, 33]]
- Question 4 [5 marks]
Discuss the purpose and potential pitfalls of the following NumPy methods:
-np.zeros
-np.ones
-np.empty
-np.full
When would you prefer one over the others in scientific computing?
-
np.zeros
: Initializes an array filled with zeros. Useful for initializing arrays in algorithms where zero is a natural starting point. -
np.ones
: Similar tonp.zeros
, but initializes with ones. Used when starting with a uniform non-zero value is necessary. -
np.empty
: Creates an uninitialized array. It’s faster thanzeros
andones
, but the values are unpredictable, requiring caution when used. -
np.full
: Initializes an array with a specified value, useful when a constant value is needed for initialization.
Preference depends on whether uninitialized data (np.empty
) is acceptable or a specific value (np.full
) is needed.
What are the key differences between Python lists and NumPy arrays? Use at least three distinct points that affect performance or usability
Answer:
- Memory Efficiency: NumPy arrays are more memory-efficient than Python lists, especially for large datasets, as they store data in contiguous blocks of memory.
- Performance: NumPy operations are vectorized, allowing for faster computation compared to the iterative nature of Python lists.
- Functionality: NumPy arrays support a wide variety of mathematical operations and broadcasting, which Python lists cannot directly handle.
You are given:
```python
a = np.array([[1, 2], [3, 4]])
print(a[:, 1])
~~~
Explain what the slicing syntax a[:, 1]
does, and how it differs from a[1, :]
.
Answer:
- a[:, 1]
selects all rows (:
) of the second column (1
), returning a 1D array: [2, 4]
.
- a[1, :]
selects the second row (1
), returning a 1D array: [3, 4]
.
- Question 7 [6 marks]
You are building a machine learning preprocessing pipeline. You want to normalize input features stored in a 2D NumPy array. Write a function that:
- Normalizes each feature column-wise (zero mean, unit variance).
- Handles missing values (NaN) by column mean imputation.
Answer:
```python
def normalize_data(data):
# Impute NaN with column means
col_means = np.nanmean(data, axis=0)
data = np.where(np.isnan(data), col_means, data)
# Normalize column-wise mean = np.mean(data, axis=0) std = np.std(data, axis=0) normalized_data = (data - mean) / std return normalized_data ~~~
This function first imputes missing values with the mean of each column and then normalizes each column.
- Question 8 [3 marks]
Given a flattened array of pixel values, describe how you would reshape it into a 2D grayscale image of dimensions (64, 64) using NumPy. What must you check before reshaping?
Answer:
To reshape a flattened array pixels
of length 4096 (64 * 64) into a 2D image:
```python
image = pixels.reshape(64, 64)
~~~
Before reshaping, ensure the length of the array matches the required dimensions (64 * 64 = 4096). If not, an error will occur.
- Question 9 [4 marks]
Describe how NumPy’saxis
argument works using examples with:
-np.sum()
-np.mean()
Then, explain what would happen if you incorrectly specify the axis in a 2D array operation.
Answer:
- np.sum(a, axis=0)
sums the columns of a
, while np.sum(a, axis=1)
sums the rows.
- np.mean(a, axis=0)
calculates the column-wise mean, while np.mean(a, axis=1)
calculates the row-wise mean.
Incorrectly specifying an axis might cause incorrect summation or averaging, or result in dimension errors.
- Question 10 [4 marks]
```python
a = np.array([1, 2, 3])
b = np.array([[1, 2], [3, 4]])
print(a + b)
~~~
Predict the output or describe the error. Why does this operation fail or succeed? How can it be modified to work as intended?
Answer:
This operation will fail because a
has shape (3,) and b
has shape (2,2), which are not compatible for broadcasting.
To fix it, either reshape a
to (2, 2) or adjust the shapes to match, like:
```python
a = np.array([[1, 2], [3, 4]])
print(a + b)
~~~
- Question 11 [5 marks]
Explain how NumPy’svectorization
improves performance compared to regular Python loops. Provide a code example that performs the same operation using both methods and compare execution times.
Answer:
Vectorization in NumPy allows you to perform operations on entire arrays at once, avoiding Python’s slow for-loops.
Example:
```python
# Without vectorization
result = 0
for i in range(1000):
result += i
With vectorization
result = np.sum(np.arange(1000))
~~~
The second method is much faster due to efficient underlying C operations.
- Question 12 [3 marks]
You are profiling a NumPy-based data processing pipeline and find a performance bottleneck. Suggest three optimization techniques you can apply (besides vectorization) to make your code run faster or use less memory.
Answer:
- Use inplace
operations (e.g., a.fillna()
instead of a = a.fillna()
).
- Use float32
instead of float64
for large numerical datasets to reduce memory usage.
- Use np.memmap()
for handling large arrays that do not fit into memory.
Question 13
You are tasked with implementing a data preprocessing pipeline for an image classification model. The images are stored in a 3D NumPy array with shape (num_images, height, width). Write a function to:
Convert all images to grayscale.
Normalize each pixel value to the range [0, 1].
Reshape the images into a 2D array for input to a model.
[6 marks]
def preprocess_images(images):
# Convert images to grayscale by averaging the color channels (assuming RGB)
grayscale_images = np.mean(images, axis=-1)
# Normalize pixel values to range [0, 1] normalized_images = grayscale_images / 255.0 # Reshape images to 2D array (num_images, height*width) reshaped_images = normalized_images.reshape(normalized_images.shape[0], -1) return reshaped_images
Consider a dataset with two features: age and income. Write a function that:
Standardizes the data (zero mean, unit variance).
Handles missing values by replacing them with the feature median.
[5 marks]
def standardize_data_with_median(data):
# Handle missing values by replacing NaNs with the median of each column
medians = np.nanmedian(data, axis=0)
data = np.where(np.isnan(data), medians, data)
# Standardize each feature (zero mean, unit variance) mean = np.mean(data, axis=0) std = np.std(data, axis=0) standardized_data = (data - mean) / std return standardized_data
You are given a dataset of customer spending patterns, represented as a 2D NumPy array where rows correspond to customers and columns correspond to products. Write a function to:
Apply logarithmic scaling to all values in the dataset (ensuring no zero or negative values).
Apply min-max scaling to each column to scale values between 0 and 1.
[6 marks]
def scale_data(data):
# Apply logarithmic scaling (avoid log(0) by adding a small constant)
log_scaled_data = np.log(data + 1e-10)
# Apply min-max scaling to each column min_vals = np.min(log_scaled_data, axis=0) max_vals = np.max(log_scaled_data, axis=0) min_max_scaled_data = (log_scaled_data - min_vals) / (max_vals - min_vals) return min_max_scaled_data
The following code is giving an error. Debug the code and explain what went wrong:
python
Copy
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([[7, 8], [9, 10], [11, 12]])
result = np.dot(a, b, c)
What is the issue with the code, and how can you fix it?
[4 marks]
The error is in the line np.dot(a, b, c). The np.dot() function only takes two arguments, but there are three in the code.
Fix: You need to separate the two operations or use np.tensordot() for higher-dimensional dot products.
result = np.dot(a, b) # Computes dot product of a and b
result2 = np.dot(result, c) # Compute dot product with c if intended
You are trying to sum elements of a 2D array along a specific axis but getting incorrect results.
python
Copy
a = np.array([[1, 2], [3, 4], [5, 6]])
print(np.sum(a, axis=1))
Explain what this does and what you expect as the output. Then, predict the output of np.sum(a, axis=0).
[4 marks]
np.sum(a, axis=1): This sums the elements along rows (axis 1). The result will be the sum of each row:
[1+2, 3+4, 5+6] = [3, 7, 11]
np.sum(a, axis=0): This sums the elements along columns (axis 0), i.e., sum each column:
[1+3+5, 2+4+6] = [9, 12]
Explain the concept of vectorization and how it improves performance over traditional Python loops. Provide an example that sums the elements of an array both using a for-loop and vectorized code.
[5 marks]
Vectorization refers to performing operations on entire arrays (or slices of arrays) at once, without using explicit loops. This is much faster in NumPy because it utilizes optimized C code under the hood.
Example:
# Without vectorization (using a for-loop)
arr = np.array([1, 2, 3, 4])
sum_without_vectorization = 0
for num in arr:
sum_without_vectorization += num
With vectorization (using NumPy directly)
sum_with_vectorization = np.sum(arr)
You need to optimize a NumPy-based code that is processing large datasets and is slow. What strategies would you use to speed up the code and reduce memory usage? List at least three optimization techniques.
[5 marks]
1) Use float32 instead of float64: For large datasets, using float32 can significantly reduce memory usage without losing much precision.
data = np.array(large_data,
dtype=np.float32)
2) Use np.memmap() for memory-mapped files: This allows you to work with large arrays that do not fit into memory by reading from disk only the parts needed.
data = np.memmap(‘data.dat’, dtype=’float32’, mode=’r’, shape=(large_shape))
3) data = np.memmap(‘data.dat’, dtype=’float32’, mode=’r’, shape=(large_shape))
a.fill(0) # Inplace operation to fill the array with 0