Lecture 7 Flashcards
1
Q
Natural Language Processing in Healthcare
A
- Manual annotations take a long time, scales poorly
o Want to automatically generate annotations using semisupervised learning - Can make use of specific markers described in the report
o E.g. number of significant lesions -> extracted using regular expressions or predicted
by NLP algorithms - Standard pipeline
o Preprocessing: tokenisation, stemming, normalisation
o Transformation: indexing, featurizing
o Mining: NLP, information extraction - Text mining: recovering insights from unstructured data
o Does the report mention presence of cancer?
2
Q
Classical Text Mining methods
A
- Regular expressions
o Especially useful for highly structured texts
o But often does not generalise well - Exact/approximate string matching
o Implemented using some distance metric
▪ Jaccard similarity -> measures similarity of two sets, defined as IoU - “Obesity” -> “Obesety” -> “Obeset” -> “Obese” = 3
▪ Levenshtein distance -> edit distance, in how many insertions, deletions and
substitutions can we transform some word into another word - “Obesity” = [O, b, e, s, i, t, y], “Obese” = [O, b, e, s, e]
- Intersection: [O, b, e, s], union: [O, b, e, s, i, t, y], IoU = 4/7
o Threshold to determine match - Lacks contextual information
- Semantic similarity is not captured
- Quickly becomes complex
3
Q
Classical Machine Learning
A
- Aims to add semantic meaning
o Word2vec
o Similar words (e.g. male-female version of
words) should have similar distances - Problems with contextual information and unseen words
4
Q
Modern Machine Learning Methods
A
- Typically Large Language Models
- Unsupervised training on massive corpora
- BERT: masked language modelling
o Predict a masked word given the context, cannot make new data - GPT: causal language modelling
o Predict the next word given the context, cannot make new data - No problems with unseen words, adds contextual representations
- Also able to be paralellised for training and inference
5
Q
Clinical NLP tasks
A
- Classification
o E.g. what is the diagnosis in the report? - Regression
o E.g. what is the reported lesion size? - Named entity recognition
o E.g. which specific areas are mentioned in the report?
6
Q
Large Language Models
A
- Pretraining: gather information from huge corpus
- Finetuning: tune for output format to allow
information extracting
o Requires:
▪ Labelled dataset of input/output
pairs
▪ Model weights of a foundation LLM or finetuning API
o Update weights to minimise perplexity
▪ How likely a model is to generate the input text sequence - Reinforcement Learning from Human Feedback (RLHF): tune for human
interaction/preferences - Must be aware of what we upload -> medical data is highly sensitive
7
Q
Vision Language Models
A
- Contrastive Language-Image Pre-training
(CLIP): predict which of the NxN possible
(image,text) pairs in one batch actually
occurred - Separate text and image embedding
- Maximise cosine similarity between real
pairs, for other pairs minimise it - Evaluation: zero-shot classification
o Use large pre-trained models that classify without being
trained on particular use case - Use name of all classes as potential text pair
o Predict most probable (image, text) pair and use it as the predicted class - Bias is present, because we can add custom classes
and the model will find results that match the class - Uses only global representations, but for medical
images we may need subtle visual cues to
distinguish between normal and abnormal
o Add attention to network, jointly learn
global and local representations
o Complement information from both the
full images and critical local region of
interest - Temporal ambiguity: must be able to match image
with report it belongs to
o A report may report changes since previous report, but this is not captured in an
image
o Solution: use prior and current image and compare report to multiple sequential
images