6 - Medical data Flashcards
(20 cards)
What are the five main types of medical data?
Tabular, unstructured text, signal/time-series, genomic/omics, medical images.
What are common challenges in medical data?
Privacy, regulation, missing data, noisy labels, biases, domain shifts.
What are examples of tabular healthcare data?
Age, sex, lab test values, diagnoses, medications, procedures.
What are the three types of variables in tabular data?
Categorical, ordinal, continuous.
What is data/label leakage?
Using future or target information in training inputs, violating causality.
What is the Clever-Hans effect?
Model relies on spurious correlations instead of causal features.
What is domain shift?
When the joint or marginal distributions of the data change between training and test data.
● Prior shift (p(y)): Change in the probability of observing a certain phenotype (e.g. during pandemic waves).
● Covariate shift (p(x)): Change in the distribution of patient features (e.g., fibrinogen levels based on time of year).
● General domain shift (p(y,x)): Change in the joint distribution.
● Concept shift (p(y∣x)): Change in disease definition, diagnostic method, or wet lab procedures (e.g., antigen test replaced by PCR test).
What is missing-not-at-random (MNAR)?
Missingness depends on unobserved data, e.g., too sick to test.
What are methods of data imputation?
Mean, median, GAIN, MICE.
What is federated learning?
Training a model across multiple sites without sharing patient data.
Name three ML methods used in medical data.
Random forest, gradient boosting, support vector machines.
What are TabNet and VIME?
Specialized DL methods for tabular data using self-supervised learning.
What are common interpretability methods?
ICE plots –> local & model-specific; show how a single prediction changes as you vary one feature
SHAP, LIME –> Local methods, preferred by clinicians
GINI importance –> built-in in decision tree models; measures how much each feature reduces impurity when used in splits
PDPs, permutation importance
What makes medical time series data challenging?
High-dimensional, irregular sampling, noise, domain shifts.
What is time series data in medicine?
Sequential measurements like ECG, ICU monitors, lab tests over time.
What are time-aware neural models?
neural network architectures specifically designed to handle irregular, time-dependent data — like what you often see in healthcare (e.g. vital signs, lab tests, ICU data)
T-LSTM, GRU-D, Neural ODEs, Temporal Fusion Transformers.
What is unstructured medical data?
Clinical notes, reports, scanned forms—requires NLP.
Name a generalist medical language model.
MedFound.
What is ClinicalBERT?
A BERT NLP model fine-tuned on clinical notes.
What are NLP tasks you might need when working with unstructured medical data?
- Named Entity Recognition
- Relation Extraction
- Document Classification
- De-identification
- Summarization