Medical imaging Flashcards

Question

Differentiate between supervised and unsupervised learning. Which category does k-means clustering belong to, and why?

Answer 1

Supervised: Labeled data (inputs + known outputs). The model is trained to map inputs to outputs. Examples: logistic regression, SVM, MLP, CNN for classification tasks with known ground truth. Unsupervised: No labels. The algorithm finds patterns or groups in raw data by itself (e.g., clustering). k-means: Unsupervised because it groups data into k clusters without any prior labels. The user just specifies k, not which cluster is correct for each data point.

Answer 2

Overfitting: The model memorizes training data details (including noise) rather than learning generalizable patterns. It performs extremely well on training data but poorly on new, unseen data. Signs in Medical Classifier: Very high accuracy on training set, but significant drop in validation/test accuracy. The model picks up spurious pixel artifacts or scanner-specific features not related to underlying pathology. If adding new diverse patient data drastically reduces performance, it indicates poor generalization.

Answer 3

Hinge Loss (for a single sample): L hinge=max(0,1−y(w⋅x+b)) Where y∈{−1,+1}. The model is penalized only if the margin is not respected or the classification is wrong. SVM tries to find a max-margin solution. Cross-Entropy Loss (logistic regression): Measures how well predicted probabilities match the true labels. L log =−[yln( y^)+(1−y)ln(1− y^)] Minimizes the negative log-likelihood, focusing on correct probabilistic output. Difference: Hinge loss is about margin and linear separability, ignoring small errors above margin. Cross-entropy enforces correct probability estimates for each class.

Answer 4

MLP: A feed-forward neural network with fully connected layers (linear combination followed by non-linear activation). Calculation: Input x is multiplied by weight matrix W (1) and added to bias b (1) Non-linear activation ϕ is applied (e.g., ReLU, sigmoid). Repeat for each layer. Output layer provides final class scores or regression values. Symbolically, z (l) =W (l) a (l−1)+b (l) ,a (l) =ϕ(z(l)) for each layer l.

Answer 5

Universal Approximation Theorem: An MLP with at least one hidden layer of sufficient size and a suitable activation function can approximate any continuous function on a compact domain, in principle. Lack of Training Guarantee: Having the capability to approximate doesn’t mean the optimization (training) will find those optimal parameters. Issues such as local minima, saddle points, vanishing/exploding gradients, or insufficient data can prevent reaching the best solution. Also, high capacity models risk overfitting if not properly regularized.

Answer 6

Cross-Entropy Advantage: Directly derived from maximum likelihood for Bernoulli (binary) or multinomial (multi-class) distributions. Produces stronger gradient signals when the prediction is wrong, speeding up convergence. Encourages correct probability estimates. Mean Squared Error Drawbacks: Less natural for classification probabilities. Can lead to slower convergence and suboptimal minima because of different error surface geometry.

Answer 7

Convolutional vs. Fully Connected: Fully Connected: Every input neuron is connected to every output neuron → a large number of parameters. Convolutional: A small kernel/filters operate locally over part of the input. Output is a feature map that scans across the image. Weight Sharing: The same set of filter weights is applied to every local region in the input. Greatly reduces the parameter count compared to a fully connected approach, and detects the same feature anywhere in the spatial domain.

Answer 8

Stride: The step size with which the kernel moves across the input. Larger stride > smaller output dimension (since we skip more pixels). Stride 1 → maximum coverage; stride 2 → “down-sample” by roughly half, etc. Padding: Typically zero-padding around the input border so that the kernel can convolve “beyond” the original edge. “Valid” convolution (no padding) shrinks the output. “Same” convolution (pad so output size = input size) often used in many CNNs to preserve dimensions if stride=1.

Answer 9

Max Pooling: Takes, for instance, a 2×2 region and outputs the maximum pixel value → reduces dimension by a factor of 2 in both width and height. Aggregates small neighborhoods, focusing on the strongest activations. Disadvantages: Can discard spatial detail (exact positions) in favor of location invariance, which might hinder tasks needing precise localization (e.g., fine boundary segmentation). Some recent architectures use strided convolutions instead of pooling, or incorporate “unpooling” in decoders to recover spatial detail.

Answer 10

Patch-wise Segmentation: A classification CNN is run on small patches around each pixel, predicting the label of the central pixel. Redundant computations because neighboring patches overlap heavily. Very time-consuming. Fully Convolutional: Processes the entire image (or large chunk) in one forward pass. Yields a dense output map for segmentation. Inefficiency of Patch-Wise: Repeated convolution operations on overlapping regions → massive computational overhead. Also can cause boundary artifacts for each patch.

Answer 11

U-Net: Encoder: Convolution + pooling layers to down-sample and extract higher-level features. Spatial resolution decreases while depth (feature channels) increases. Bottleneck: The deepest layer with the smallest spatial dimension. Decoder: Uses transposed convolutions (or up-convolutions) to up-sample. Gradually restores spatial resolution. Skip Connections: Each stage in the encoder is fed (copied) to the corresponding decoder stage. Helps the decoder recover fine-grained details lost during down-sampling. The shape is reminiscent of a “U” in the architecture diagram.

Answer 12

Classification CNN: Final layer is typically fully connected, producing a single or small vector of class probabilities for the entire image. Segmentation CNN (fully convolutional): Final layer is convolutional, generating a probability map (one channel per class) matching the spatial resolution (or near it) of the input. Each pixel in the final feature map corresponds to a pixel location in the input, indicating the probability of belonging to each class.

Answer 13

Dice Coefficient: Dice(A,B)= 2∣A∩B∣/∣A∣+∣B∣ A is the set of predicted pixels, B is the ground-truth set. Why More Informative: Accuracy can be misleading if the class is unbalanced (e.g., background is 95% of the image). Dice focuses on overlap between predicted and true regions. Even if the background is huge, the metric specifically measures how well the “positive” region is matched. It’s more sensitive to class mismatch when the target region is small.

Answer 14

Sensitivity (Recall): TP/ TP + FN Probability that an actual positive (pathology present) is correctly identified. Specificity: TN/TN + FP Probability that an actual negative (healthy) is correctly identified. Precision: TP/ TP + FP Of all predicted positives, how many are truly positive?

Answer 15

Handling Class Imbalance: Data-level: Over-sampling minority class (e.g., augmenting tumor patches) or under-sampling majority class. Algorithm-level: Adjust loss function (e.g., weighting cross-entropy to penalize mistakes on minority class more heavily). Use dice-based or IoU-based losses specifically for segmentation tasks with small target objects. Collect more data: If feasible, to ensure minority class is better represented

Answer 16

Robustness Testing: Add synthetic noise (Gaussian, Poisson) at different levels to the input images to see if the segmentation degrades gracefully or catastrophically. Vary acquisition parameters like slice thickness, resolution, or contrast agent dose and re-run the segmentation. Cross-device testing: Evaluate scans from different manufacturers (GE vs. Siemens) to check domain generalization. Phantom tests: Use known phantoms with well-defined geometry to quantitatively measure errors under controlled changes in imaging settings.

Medical imaging Flashcards

(40 cards)