Medical imaging Flashcards

(40 cards)

1
Q

Which imaging modalities use ionizing radiation and which do not, and why is this distinction clinically important?

A

Ionizing Radiation Modalities: X-rays (plain radiographs)
CT (Computed Tomography)
Nuclear Medicine scans (e.g., PET, SPECT) because they use radioactive tracers emitting gamma rays

Non-Ionizing Radiation Modalities:
MRI (Magnetic Resonance Imaging): uses magnetic fields and radiofrequency waves.
Ultrasound: uses high-frequency sound waves.

Clinical Importance:
Ionizing radiation can damage biological tissues and DNA if exposure is high or frequent, potentially increasing cancer risk. Therefore, dose minimization is crucial.

Non-ionizing imaging (MRI/Ultrasound) is generally safer regarding radiation concerns but might have other limitations (e.g., cost, availability, scanning time).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain how CT scanning reconstructs cross-sectional slices from multiple X-ray projections, and why it provides better soft-tissue contrast than a conventional X-ray.

A

Reconstruction Principle:
- In CT, the X-ray source and detector rotate 360° around the patient. Multiple 1D projections are acquired at many angles.

  • A reconstruction algorithm (often Filtered Back-Projection or modern iterative methods) mathematically reconstructs a 2D slice from these projections.

Better Soft-Tissue Contrast:
- Conventional X-rays compress all tissues into a single 2D projection, so overlapping structures can hide subtle differences.
- CT provides cross-sectional slices, reducing overlap. Tiny differences in attenuation (density) become more apparent in the slice, yielding improved soft-tissue discrimination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the role of radioisotopes in PET (Positron Emission Tomography), and how does it fundamentally differ from MRI in terms of the information it provides?

A

Role of Radioisotopes in PET:

  • A tracer (e.g., FDG, a glucose analog labeled with ^18F) is injected into the patient. Cancer cells or metabolically active tissues take up more tracer.
  • When the isotope decays, it emits positrons which annihilate with electrons, producing gamma photons detected by the PET scanner.

Difference from MRI:

  • PET: Provides functional or metabolic information (e.g., how actively a region uses glucose).
  • MRI: Provides anatomical and some functional details (e.g., T1/T2-weighted contrasts or fMRI for blood oxygen level) but not direct metabolic uptake images. MRI relies on magnetic properties of hydrogen nuclei in tissues.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define bit depth in digital images and discuss how differences in bit depth (e.g., 8-bit vs. 16-bit) impact possible intensity values.

A

Bit Depth: The number of bits used to represent each pixel’s intensity.
- 8-bit: 2^8 = 256 discrete intensity levels.
- 16-bit: 2^16 = 65,536 levels.

Impact:
A higher bit depth allows a wider range of intensities and finer gradations (useful in modalities like CT or some microscopy).
8-bit images may risk losing subtle intensity differences, whereas 16-bit retains small gradations but often requires special viewing/processing for display.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is DICOM, and why is it crucial for the interoperability of medical imaging devices?

A

DICOM (Digital Imaging and Communications in Medicine):
- A standardized format and network protocol for storing/transmitting medical images.
- Encodes not only pixel data but also metadata (patient ID, study date, modality parameters, etc.).

Importance:
Ensures interoperability across different scanners, PACS (Picture Archiving and Communication Systems), and software.
Allows hospitals and clinics to exchange imaging studies reliably, enabling consistent workflows worldwide.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the Nyquist–Shannon sampling theorem in simple terms, and how violating it causes aliasing artifacts in images.

A

Sampling Theorem: A signal must be sampled at least at twice its highest frequency component (the Nyquist rate) to capture all information without losing detail.

Violation → Aliasing:
If sampling is too sparse, high-frequency details appear as misleading lower-frequency patterns (e.g., moiré).
In images, small repetitive structures or sharp edges can be incorrectly rendered, creating artifacts or patterns that weren’t actually there.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a histogram in image processing, and how do you use it for simple thresholding of an image (e.g., Otsu’s method)?

A

Histogram: A distribution of the pixel intensities. For grayscale images, the x-axis is intensity and y-axis is the count (or probability) of pixels at each intensity.

Thresholding:

E.g., Otsu’s Method: Automatically finds an intensity threshold that separates foreground and background by maximizing between-class variance (or minimizing within-class variance).

Implementation typically calculates for each candidate threshold how well it separates the histogram into two clusters, and picks the best threshold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does linear contrast stretching work, and why do we sometimes apply it to 16-bit images before visualization?

A

Linear Contrast Stretching:

Maps a range [I_\min, I_\max] to a new dynamic range [0,255 (if 8-bit output).

Formula: 𝐼out=𝛼 (𝐼in−𝐼min)
𝛼=255/𝐼max−𝐼min

Reason for 16-bit → 8-bit:

16-bit images hold many intensity levels, but typical monitors display only 8-bit. A linear stretch ensures the visible range (0–255) best represents the relevant intensities.

This reveals subtle differences otherwise hidden in the extended range of the 16-bit image.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Outline the basic formula for 2D convolution and explain how kernel flipping differs between convolution and cross-correlation.

A

2D Convolution:

g(x,y)=(f∗h)(x,y)=−k∑kn=−k∑k f(x−m,y−n)h(m,n)

Kernel Flipping:
True mathematical convolution flips the kernel in both x and y directions (i.e., h(−m,−n)).

Cross-correlation omits the flip, effectively using h(m,n) as is.

Many image processing libraries do cross-correlation by default because it’s simpler for template matching.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Compare the box (mean) filter with a Gaussian filter in terms of smoothing performance and side effects on edges.

A

Box (Mean) Filter:

Simple average of the neighborhood.

Strongly blurs edges (noisy corners are significantly smoothed).

Can introduce block-like artifacts because each pixel is equally weighted in the local window.

Gaussian Filter:

Weights pixels according to a Gaussian distribution (center > edges of the window).

More natural smoothing, less likely to create abrupt artifacts.

Preserves edges slightly better than box filter because it places more emphasis on central pixels.

Also used in many scale-space techniques (LoG, DoG, etc.).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Sobel operator, and how do its horizontal and vertical kernels detect edge directions?
Answer:

A

Sobel Operator:

A pair of 3×3 filters used to approximate the intensity gradient in horizontal (x) and vertical (y) directions.

Typically:

𝐺𝑥=[−1 0 1; -2 0 2; -1 0 1]
𝐺𝑦=[−1−2−1; 0 0 0; 1 2 1]

Edge Direction Detection:

Convolving with G x captures changes along x (vertical edges).

Convolving with G y captures changes along y (horizontal edges).

Gradient magnitude indicates edge strength, and the arctan(𝐺𝑦/𝐺𝑥) indicates direction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe how the Laplacian-of-Gaussian filter (LoG) is used for blob detection and why scale selection is crucial.

A

LoG Filter:

∇2(𝐺𝜎∗𝐼): a second derivative-based operator on a Gaussian-smoothed image.

Detects regions where intensity changes from bright to dark or vice versa in a “blob”-like manner.

Blob Detection:

A circular (or elliptical) region can be identified if LoG response is high and changes sign near the center.

By examining multiple σ values, one can detect blobs of different sizes (multiscale analysis).

Scale Selection:

Real-world objects come in different sizes. A single σ might only detect certain sized blobs.

Using a range of scales ensures capturing small, medium, or large structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Briefly describe the four main steps of SIFT (Scale-Space Extrema, Keypoint Localization, Orientation Assignment, Descriptor Formation).

A

Scale-Space Extrema:

Build a Gaussian pyramid and compute Difference of Gaussians (DoG) at multiple scales.

Identify local maxima/minima in a 3D neighborhood (x, y, scale).

Keypoint Localization:

Refine each candidate’s location and scale.

Reject low-contrast points or points on edges (via a Hessian-based check).

Orientation Assignment:

Compute local gradient orientations in a region around the keypoint.

Assign a dominant orientation (or multiple if they are within 80% of the peak).

Descriptor Formation:

For a 16×16 window around the keypoint (rotated to the assigned orientation), form 4×4 subregions.

Each subregion accumulates an 8-bin gradient histogram → 4×4×8 = 128D descriptor vector.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is Difference of Gaussians (DoG) used as an approximation to the Laplacian in SIFT, and how does it help identify scale-invariant keypoints?

A

DoG Approximation:

The Laplacian-of-Gaussian can be computed by subtracting two Gaussians at nearby scales (Gauss(σ) - Gauss(κσ)).

This is computationally more efficient and stable than directly convolving with LoG.

Scale Invariance:

By building a pyramid of images blurred at different scales, local maxima in DoG indicate potential blobs (keypoints).

The scale at which the response is maximal corresponds to the intrinsic scale of the feature, making detection invariant to image size changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In the context of SIFT, what do we mean by rotation invariance, and how is it achieved in practice?

A

Rotation Invariance:

Keypoints should be matched even if the object appears rotated in another image.

Implementation:

Compute local gradients in the neighborhood of the keypoint.

Determine the dominant orientation by finding the peak in the gradient orientation histogram.

“Rotate” the coordinate frame of the descriptor to align with this orientation.

The final descriptor is effectively anchored to that orientation, making it rotation invariant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

List the main types of geometric transformations (rigid, affine, non-rigid) and describe a scenario where each is most appropriate.

A

Rigid (translation + rotation, sometimes 3D includes rotation about 3 axes):

Appropriate when the object or anatomy doesn’t deform (e.g., registration of brain images over short times, or registering an object with no shape change).

Affine (includes scaling, shear in addition to rigid):

Useful for images with slight scaling or shear differences (e.g., comparing scans from devices with slightly different pixel spacing or magnification).

Non-rigid / Deformable:

Accounts for local tissue warping or motion (e.g., matching a follow-up MRI with organ shape changes over time, or motion in dynamic imaging of organs that physically move).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is mutual information (MI) in image registration, and why is it commonly used for multimodal registration (e.g., MRI to CT)?

A

Mutual Information:

A measure of how much knowing one variable (intensity in MRI) reduces uncertainty about the other (intensity in CT).

Mathematically derived from entropy:
MI(A,B)=H(A)+H(B)−H(A,B).

Usage in Multimodal:

Different modalities yield different intensity distributions for the same tissue. So simpler metrics (like sum of squared differences) don’t always work.

MI is higher when the images are well aligned, because the intensity relationship becomes more predictable.

18
Q

How do fiducial markers facilitate image registration, and what is the difference between intrinsic and extrinsic fiducials?

A

Fiducial Markers:

Points or objects visible in imaging that serve as known correspondences between images. They reduce the search space for the transform parameters.

Once you locate them in both images, you can solve for the best transformation aligning the markers.

Intrinsic vs. Extrinsic:

Intrinsic: Naturally occurring landmarks in the anatomy (e.g., corners of the ventricles in the brain).

Extrinsic: Artificial markers placed on or in the patient (e.g., skin markers, bone-implanted markers).

19
Q

Explain Otsu’s thresholding method: how does it partition the image histogram, and what criterion does it optimize?

A

Otsu’s Method:

Evaluates every possible threshold t.

Splits the histogram into two classes: {0,…,t} and {t+1,…,L−1}.

For each t, compute within-class variance (or equivalently between-class variance).

Picks the threshold t ∗ that maximizes the separation (i.e., largest between-class variance or smallest within-class variance).

Essentially, it finds a threshold that best separates background and foreground in a bimodal histogram.

20
Q

Compare k-means segmentation with Gaussian Mixture Model segmentation in terms of assumptions and flexibility.

A

K-means:

Assumes each cluster is roughly spherical in feature space and uses Euclidean distance.

Hard assignments: each pixel belongs entirely to one cluster.

Typically simpler and faster but less flexible if data is not well-separated in spherical clusters.

Gaussian Mixture Model (GMM):

Each cluster is modeled as a Gaussian distribution with mean and covariance (can have elliptical shapes).

Soft assignments: a pixel has a probability of belonging to each cluster.

More computationally expensive (EM algorithm) but can better represent complex data distributions.

21
Q

Describe a typical watershed algorithm for segmentation: how is the topographic analogy used, and what are potential pitfalls like over-segmentation?

A

Watershed:

Interpret intensity as elevation.

“Water” floods from low-intensity basins upward. The boundaries (watershed lines) form the segmentation.

Each local minimum forms a catchment basin.

Topographic Analogy:

If you pour water at each local minimum, the water will fill up until it meets water rising from another basin, forming the watershed boundary.

Pitfalls:

Over-segmentation if there are many shallow minima.

Markers or pre-processing (distance transforms, morphological filters) can mitigate that by controlling which minima are relevant.

22
Q

What is the motivation for using superpixels, and how does SLIC (Simple Linear Iterative Clustering) generate superpixels that respect image boundaries?

A

Motivation:

Instead of dealing with millions of individual pixels, group them into a few thousand “superpixels” that adhere to object boundaries.

Reduces complexity for subsequent steps (segmentation, classification).

SLIC:

Initializes cluster centers on a grid, spacing =sqrt(N/k) where k is desired number of superpixels.

Performs iterative local k-means in a 5D space (L,a,b,x,y) for color + coordinates.

Enforces a limited search region (2S×2S around each center), making it computationally efficient.

Final superpixels usually align well with boundaries due to the combined color + position distance measure.

23
Q

Explain why SLIC only needs to do distance computations within a 2S×2S region around the superpixel center, rather than the entire image

A

Each superpixel is expected to have a roughly square shape of size S×S.

Because of how the cluster centers are laid out, a pixel far outside that region is extremely unlikely to belong to that superpixel.

This local search drastically reduces computations from a naive approach where each pixel is compared to all cluster centers.

24
Q

After computing superpixels, how can we merge or classify them into larger object segments?

A

Merging/Classification Approaches:

Region Merging: Evaluate adjacency of superpixels and merge if they are sufficiently similar in color or texture.

Graph-Based: Treat superpixels as nodes in a graph, edges have a similarity measure—run a higher-level segmentation method like region adjacency or graph cut.

Classifier: Extract features (e.g., mean intensity, texture) from each superpixel and train a supervised classifier to label them (foreground vs. background, or multiple classes).

25
Differentiate between supervised and unsupervised learning. Which category does k-means clustering belong to, and why?
Supervised: Labeled data (inputs + known outputs). The model is trained to map inputs to outputs. Examples: logistic regression, SVM, MLP, CNN for classification tasks with known ground truth. Unsupervised: No labels. The algorithm finds patterns or groups in raw data by itself (e.g., clustering). k-means: Unsupervised because it groups data into k clusters without any prior labels. The user just specifies k, not which cluster is correct for each data point.
26
Define overfitting in machine learning. In the context of a medical image classifier, what might be signs that the model is overfitting?
Overfitting: The model memorizes training data details (including noise) rather than learning generalizable patterns. It performs extremely well on training data but poorly on new, unseen data. Signs in Medical Classifier: Very high accuracy on training set, but significant drop in validation/test accuracy. The model picks up spurious pixel artifacts or scanner-specific features not related to underlying pathology. If adding new diverse patient data drastically reduces performance, it indicates poor generalization.
27
Explain the hinge loss used by SVMs and how it differs from the cross-entropy loss used by logistic regression. Answer:
Hinge Loss (for a single sample): L hinge=max(0,1−y(w⋅x+b)) Where y∈{−1,+1}. The model is penalized only if the margin is not respected or the classification is wrong. SVM tries to find a max-margin solution. Cross-Entropy Loss (logistic regression): Measures how well predicted probabilities match the true labels. L log =−[yln( y^)+(1−y)ln(1− y^)] Minimizes the negative log-likelihood, focusing on correct probabilistic output. Difference: Hinge loss is about margin and linear separability, ignoring small errors above margin. Cross-entropy enforces correct probability estimates for each class.
28
What is a multi-layer perceptron (MLP), and how does it calculate an output starting from an input vector of pixel or feature intensities?
MLP: A feed-forward neural network with fully connected layers (linear combination followed by non-linear activation). Calculation: Input x is multiplied by weight matrix W (1) and added to bias b (1) Non-linear activation ϕ is applied (e.g., ReLU, sigmoid). Repeat for each layer. Output layer provides final class scores or regression values. Symbolically, z (l) =W (l) a (l−1)+b (l) ,a (l) =ϕ(z(l)) for each layer l.
29
Describe the universal approximation theorem for MLPs. Why does it not guarantee that the model will train effectively in practice?
Universal Approximation Theorem: An MLP with at least one hidden layer of sufficient size and a suitable activation function can approximate any continuous function on a compact domain, in principle. Lack of Training Guarantee: Having the capability to approximate doesn’t mean the optimization (training) will find those optimal parameters. Issues such as local minima, saddle points, vanishing/exploding gradients, or insufficient data can prevent reaching the best solution. Also, high capacity models risk overfitting if not properly regularized.
30
In logistic regression or MLP classification, why is the cross-entropy (log-loss) typically preferred over mean squared error for training?
Cross-Entropy Advantage: Directly derived from maximum likelihood for Bernoulli (binary) or multinomial (multi-class) distributions. Produces stronger gradient signals when the prediction is wrong, speeding up convergence. Encourages correct probability estimates. Mean Squared Error Drawbacks: Less natural for classification probabilities. Can lead to slower convergence and suboptimal minima because of different error surface geometry.
31
Explain how convolutional layers differ from fully connected layers, and what ‘weight sharing’ means in the context of CNNs.
Convolutional vs. Fully Connected: Fully Connected: Every input neuron is connected to every output neuron → a large number of parameters. Convolutional: A small kernel/filters operate locally over part of the input. Output is a feature map that scans across the image. Weight Sharing: The same set of filter weights is applied to every local region in the input. Greatly reduces the parameter count compared to a fully connected approach, and detects the same feature anywhere in the spatial domain.
32
Describe the effect of stride and padding on the output feature map size in a CNN convolution operation.
Stride: The step size with which the kernel moves across the input. Larger stride > smaller output dimension (since we skip more pixels). Stride 1 → maximum coverage; stride 2 → “down-sample” by roughly half, etc. Padding: Typically zero-padding around the input border so that the kernel can convolve “beyond” the original edge. “Valid” convolution (no padding) shrinks the output. “Same” convolution (pad so output size = input size) often used in many CNNs to preserve dimensions if stride=1.
33
How does max pooling help to reduce the spatial size of feature maps, and what are potential disadvantages of pooling?
Max Pooling: Takes, for instance, a 2×2 region and outputs the maximum pixel value → reduces dimension by a factor of 2 in both width and height. Aggregates small neighborhoods, focusing on the strongest activations. Disadvantages: Can discard spatial detail (exact positions) in favor of location invariance, which might hinder tasks needing precise localization (e.g., fine boundary segmentation). Some recent architectures use strided convolutions instead of pooling, or incorporate “unpooling” in decoders to recover spatial detail.
34
Compare patch-wise segmentation vs. a fully convolutional network approach. Why is patch-wise often inefficient?
Patch-wise Segmentation: A classification CNN is run on small patches around each pixel, predicting the label of the central pixel. Redundant computations because neighboring patches overlap heavily. Very time-consuming. Fully Convolutional: Processes the entire image (or large chunk) in one forward pass. Yields a dense output map for segmentation. Inefficiency of Patch-Wise: Repeated convolution operations on overlapping regions → massive computational overhead. Also can cause boundary artifacts for each patch.
35
Outline the basic structure of a U-Net model, emphasizing the role of skip connections between encoder and decoder paths.
U-Net: Encoder: Convolution + pooling layers to down-sample and extract higher-level features. Spatial resolution decreases while depth (feature channels) increases. Bottleneck: The deepest layer with the smallest spatial dimension. Decoder: Uses transposed convolutions (or up-convolutions) to up-sample. Gradually restores spatial resolution. Skip Connections: Each stage in the encoder is fed (copied) to the corresponding decoder stage. Helps the decoder recover fine-grained details lost during down-sampling. The shape is reminiscent of a “U” in the architecture diagram.
36
What is the main difference between a classification CNN’s output layer and a segmentation CNN’s output layer?
Classification CNN: Final layer is typically fully connected, producing a single or small vector of class probabilities for the entire image. Segmentation CNN (fully convolutional): Final layer is convolutional, generating a probability map (one channel per class) matching the spatial resolution (or near it) of the input. Each pixel in the final feature map corresponds to a pixel location in the input, indicating the probability of belonging to each class.
37
Explain how the Dice similarity coefficient is computed for segmentation and why it can be more informative than accuracy.
Dice Coefficient: Dice(A,B)= 2∣A∩B∣/∣A∣+∣B∣ A is the set of predicted pixels, B is the ground-truth set. Why More Informative: Accuracy can be misleading if the class is unbalanced (e.g., background is 95% of the image). Dice focuses on overlap between predicted and true regions. Even if the background is huge, the metric specifically measures how well the “positive” region is matched. It’s more sensitive to class mismatch when the target region is small.
38
Define sensitivity, specificity, and precision in the context of a binary classification for pathology detection.
Sensitivity (Recall): TP/ TP + FN Probability that an actual positive (pathology present) is correctly identified. Specificity: TN/TN + FP Probability that an actual negative (healthy) is correctly identified. Precision: TP/ TP + FP ​ Of all predicted positives, how many are truly positive?
39
In medical image analysis, how do we handle the issue of class imbalance (e.g., very few pixels are tumor vs. many normal)?
Handling Class Imbalance: Data-level: Over-sampling minority class (e.g., augmenting tumor patches) or under-sampling majority class. Algorithm-level: Adjust loss function (e.g., weighting cross-entropy to penalize mistakes on minority class more heavily). Use dice-based or IoU-based losses specifically for segmentation tasks with small target objects. Collect more data: If feasible, to ensure minority class is better represented
40
Give examples of how you would test the robustness of your segmentation algorithm to noise or changes in image acquisition protocols.
Robustness Testing: Add synthetic noise (Gaussian, Poisson) at different levels to the input images to see if the segmentation degrades gracefully or catastrophically. Vary acquisition parameters like slice thickness, resolution, or contrast agent dose and re-run the segmentation. Cross-device testing: Evaluate scans from different manufacturers (GE vs. Siemens) to check domain generalization. Phantom tests: Use known phantoms with well-defined geometry to quantitatively measure errors under controlled changes in imaging settings.