Starter: GenAI Flashcards

(100 cards)

1
Q

What does a Temperature value of T=1 effectively correspond to in generation?

Compared to 0 or higher values.

A

It makes the Softmax output more default scaled across all tokens, meaning the model is more likely to sample tokens that are not the most probable compared to a T of 0 but less random than a higher temperature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How can a data scientist control the number of tokens generated by an LLM?

A

Use the ‘Max Output Tokens’ parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the purpose of using a decoder in an LLM?

A

The decoder is responsible for generating the output sequence based on the encoded representation of the input sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the role of attention in an LLM?

A

Attention allows the decoder to focus on the most relevant parts of the input sequence when generating the output sequence. And the encoder to learn relationships between words at the conceptual level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does attention work in an LLM?

A

Attention is a mechanism that allows the decoder to attend to different parts of the input sequence based on their importance for generating the current output token. This is achieved by calculating a set of attention weights that represent the attention level for each input token. The attention weights are then used to compute a weighted sum of the input tokens, which is then used to generate the current output token.

On the encoder, it helps to learn a representation of the related concepts with different attention heads learning different relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the impact of attention on the performance of an LLM?

A

Attention can significantly improve the performance of an LLM by allowing it to focus on the most relevant parts of the input sequence when generating the output sequence. This can lead to more accurate, coherent, and fluent outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some challenges associated with attention in LLMs?

A

Attention can be computationally expensive, and it can also be difficult to train effectively. Additionally, attention can sometimes lead to overfitting, which can make the model less robust to new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What fundamental problem in sequence modeling (like RNNs) does the Transformer’s “Attention” mechanism solve?

A

Attention allows the model to directly weigh the relevance of any word in the input sequence when processing another word, regardless of their distance. This overcomes the difficulty RNNs had with capturing long-range dependencies effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain the typical journey of an input word through a Transformer.

A
  1. Tokenization: Word is broken into token(s).
  2. Embedding: Token converted to a numerical vector.
  3. Positional Encoding: Information about the token’s position in the sequence is added to the embedding.
  4. Multi-Head Self-Attention: The model calculates attention scores between this token and all other tokens in the input to create a context-aware representation.
  5. Feed-Forward Network: Further processing occurs via a fully connected network to refine the representation.
  6. Deep representation passed from Encoder to Decoder & Decoder
  7. Decoder receives start of sequence token & then iteratively generates sequence until Max output token or stop token.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is Positional Encoding necessary in Transformers?

A

Because the core Self-Attention mechanism doesn’t inherently consider the order of words. Positional Encodings inject this crucial sequence information into the token representations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does the Decoder use information from both the input (via Encoder) and its own previously generated output?

A

It uses Masked Self-Attention to consider the previously generated tokens and Cross-Attention to incorporate the contextual information from the Encoder’s final output representations. This combination allows it to predict the next most likely token based on both the input prompt and the output generated so far.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Tokenization, and why is the choice of strategy important for an LLM?

A

Tokenization is breaking input text into smaller units (tokens) the model processes (e.g., words, sub-words). The strategy impacts the vocabulary size, handling of rare words, and ultimately, model performance and efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why are Embeddings a cornerstone of how LLMs process language?

A

Embeddings convert discrete tokens into dense numerical vectors where semantic relationships are captured geometrically (similar words have closer vectors). This allows the model to perform mathematical operations that reflect linguistic meaning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In the GenAI Project Lifecycle, why is “Define the use case” the critical first step for a data scientist?

A

Clearly defining the use case guides all subsequent decisions: selecting the right foundation model (general vs. specialized), choosing the adaptation strategy (prompting vs. fine-tuning), and defining relevant evaluation metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the three primary methods for adapting a base LLM for a specific task, post-selection?

A
  1. Prompt Engineering: Crafting effective prompts, potentially with examples (few-shot).
  2. Fine-tuning: Further training the model on a dataset specific to the target task.
  3. Aligning with Human Feedback: Using techniques like Reinforcement Learning from Human Feedback (RLHF) to steer the model towards desired behaviors (e.g., helpfulness, harmlessness). Evaluation follows adaptation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

When should a data scientist consider fine-tuning an LLM versus relying solely on prompt engineering?

A

Consider fine-tuning when:
a) Prompt engineering (even few-shot) doesn’t yield sufficient performance.
b) The task requires deep domain-specific knowledge adaptation.
c) You have a suitable dataset and computational resources for training. Prompting is generally faster and less resource-intensive for simpler adaptations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What’s a key trade-off when choosing between a very large foundation model and a smaller, potentially domain-specific one?

A

Large models offer broad knowledge and strong zero/few-shot capabilities but are computationally expensive. Smaller models can be more efficient and potentially achieve higher performance on specific tasks (especially after fine-tuning) but lack the general world knowledge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

As a data scientist, when is few-shot prompting likely more effective than zero-shot?

A

Use few-shot when the task requires specific formatting, complex reasoning, or is nuanced. Providing examples guides the model’s output more precisely than instructions alone, especially beneficial for less capable models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does “In-context learning” (ICL) refer to in LLMs?

A

The model’s ability to learn how to perform a task based only on the examples provided within the prompt itself (zero-shot, single-shot, few-shot), without updating its internal weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How can a data scientist adjust an LLM’s output to be more deterministic/focused versus more creative/diverse?

A

Samples from the K most probable tokens. Higher K means more creative/less coherent.

Top P: Samples from the smallest set of tokens whose cumulative probability exceeds P. Adapts better to the shape of the probability distribution, often preferred for balancing quality and diversity.

Temperature 0 always gives same output. Temperature > 1 more than the standard creativity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the potential consequences of setting the ‘Max Output Tokens’ parameter incorrectly?

A

The model’s response may be cut off mid-thought, incomplete, or unable to fulfill the prompt’s requirements. Too high: Can lead to unnecessary computation or overly verbose/rambling outputs if the model doesn’t naturally conclude sooner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does a Temperature value of T=0 effectively correspond to in generation?

A

It makes the Softmax output extremely sharp, essentially forcing the model to always pick the single token with the absolute highest probability, equivalent to Greedy sampling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What do Scaling Laws tell us? What paper proved this?

A

They show model performance improves with increases in dataset size, model parameters, or compute, following a power-law relationship.

OpenAI’s “Scaling Laws for Neural Language Models”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What was the Chinchilla paper's key insight? Name of the paper?
Optimal performance isn't just about the biggest model. It's about balancing model size and training data size. Smaller models trained on more data can outperform larger, under-trained models. Suggests a ~20:1 token-to-parameter ratio might be compute-optimal. Training Compute Optimal Large Language Models
26
When is further fine-tuning or a domain-specific model needed?
When the target domain has unique vocabulary or word meanings not well-represented in general foundation models (e.g., medicine, finance). I.e. legal jargen, medical jargen, internal company acronyms/program names.
27
What's a challenge with domain-specific fine-tuning?
Acquiring sufficient high-quality, specialized training data can be difficult, since the problem is usually that the domain has very specific language/use of language that's not commonly found.
28
How are the high costs of training managed?
- Quantization: Using lower-precision numbers for weights to save memory at training & inference. - Distributed Training: Using multiple GPUs in parallel (e.g., Distributed Data Parallel).
29
What's the overall flow of a GenAI project lifecycle?
It's an iterative process: Scope (Define Use Case) -> Select (Choose/Pre-train Model) -> Adapt & Align (Prompting, Fine-tuning, Evaluate) -> Application Integration (Deploy, Optimize). Expect to cycle between adapting and evaluating.
30
What are some important considerations when choosing a pre-trained model?
1. The architecture, which helps one understand things it might be good/bad at, e.g. Decoders are good at language generation, while Encoders are good at sentiment analysis. 2. The size of the model, which helps determine the cost of compute. 3. The training data, specifically if it had in its training data what's needed to be good at the task you're asking it to do.
31
What are three different architectures for transformers & break them down by: Name Also Known As How it Trains Objective Good For Examples
Encoding only models: Known as: Auto-encoders How it trains: (Masked Language Modeling) masks a single token & tries to predict it from the context, forming a bi-directional representation of the context Objective: uses a “De-noising objective” Good for: entity recognition, sentiment analysis, word classification Examples: BERT & ROBERTA Decoder only models: Known as: Auto-regressive How it trains: (Causal Language Modeling) These intake a sequence & try to predict only the next token (unidirectionally) Objective: predict the next token (full language modeling) Good for: language generation, zero-short inference well for lots of things at scale Examples: GPT, BLOOM Encoder + Decoder Models: Known as: Sequence to Sequence model How it trains: (Span Corruption) masks random sequences of tokens that are then replaced with a single “Sentinel Token” Objective: decoder reconstructs the input auto-regressively Good for: translation, summarization, question answering Good where we have body of text as input & output Examples: BART, T5
32
What is quantization & what problem does it solve? How does it work?
A method of storing model weights will less precision to make model training & inference less computationally intensive. It works by projecting the 32 bit floats typically used to store model weights into a less precise storing format, like BFLOAT 16. - It is based on the range of numbers of the parameters that are used. - Most libraries now offer Quantization aware training frameworks that learn the quantization scaling factors during training
33
What is the core question related to compute-optimal models? What are the key factors?
How do we get the best model performance for the least compute, i.e. a compute optimal model? Three key ways to improve model performance: - More data - More model parameters to make meaning from data - More compute for training (higher compute, more time training), but we're usually optimizing for this since it's the constraint
34
What problem does Instruction Fine-tuning solve? What is the largest risk of Instruction Fine-tuning? When would you use IFT?
IFT takes a model that has general world knowledge, usually from next word prediction, and makes it better at following instructions for specific tasks, which is what these models are generally used for. Catastrophic Forgetting: when a model from fine-tuning forgets crucial or significant information learned during pre-training. Use it when in context learning (ICL) does not help the model achieve sufficient task performance.
35
What is parameter efficient fine-tuning? What problem(s) does it solve?
PEFT is freezing some or all of the existing model parameters & training only a subset to tune them to tasks. This helps make fine-tuning less computationally expensive & mitigates against catastrophic forgetting since most model weights are unchanged. LoRA is a popular method for this: Low Rank Adaptation.
36
What are the steps of IFT?
1. Get a pre-trained model 2. Assemble a dataset of prompts + completions where the prompt = instruction + context & the completion = desired result 3. Split the dataset into Train, Validate, Test sets 4. Fine-tune the model on this dataset using a loss function (often with parallel distributed computation, PEFT, & quantization), tune hyperparameters with the validation set, & get model performance with the test set Output: Instruct Model
37
When people say fine-tuning, they almost always mean what kind of fine-tuning?
Instruction Fine-tuning
38
How many examples does it to take to do effective fine-tuning on a single task? What is the primary risk of fine-tuning on a specific task?
1. 500-1000 2. Catastrophic Forgetting, i.e. forgetting key information learned in training & failing to be effective on other tasks
39
When does catastrophic forgetting matter? How do you mitigate it?
It matters if you need your LLM to be good at more than just the task you finetuned on, i.e. Fine-tuned for sentiment analysis but lose the ability to do entity recognition well. Mitigated by: 1. PEFT - Parameter efficient fine-tuning, which only changes some parameters associated with the instructions/tasks or adds adaptive layers that are tuned instead 2. Fine-tune on multiple tasks at the same time: give a mix of instruction prompt + completion tasks in fine-tuning so it retains a more general ability to complete tasks beyond just a single instruction problem
40
What are the challenges & benefits of multi-task instruction fine-tuning?
Challenges: hard to assemble the datasets which often require 50-100k pairs. Benefits: a great way to improve performance after pre-training.
41
What are examples of models that have multi-task instruction fine-tuning? Describe the dataset? What’s the metaphor researchers used for this?
1. FLaN T5, FLaN PALM are both models that have been fine-tuned with the FLAN dataset 2. FLaN - fine-tuned language network is the dataset used for fine-tuning which has been fine-tuned on: - 473 datasets chosen from other models/papers - 146 task categories Example: SamSum - 16k messenger like conversations with summaries 3. “The metaphorical desert to the main course of pre-training”
42
What is a way to make instruction datasets more effective & what’s an example?
Instruction datasets can have the same completion associated with multiple different sets of instruction that means the same thing. SamSum: “Can you summarize what was said in that conversation?” “Briefly summarize the dialogue.” “What were the main points in that conversation?” Etc.
43
What’s an important consideration for instruction data when doing additional fine-tuning?
Understand the limitations of the original datasets & create or find datasets that most closely match the task(s) the model will perform for your application. E.x. - summarizing chats back and forth on a social media platform might be quite different than summarizing customer service chats for booking a hotel.
44
How does evaluating Language models differ from regular machine learning?
Regular ML is Accurate predictions / Total predictions In language, single word differences can still be “accurate” & single word differences can be completely wrong: - The dog ran quickly to the store - The dog ran to the store - The dog didn’t run from the store We also use benchmarks for overall generalizable performance & metrics (BLEU & ROUGE) for model evaluation (iteratively) since LLMs are used primarily for more generalizable use cases.
45
What are some classical language modeling metrics and what are they used for?
ROUGE: Recall-Oriented Understudy for Gisting Evaluation Used in summarization - compares one summary to human reference summaries BLEU Score: Bi-lingual Evaluation Understudy Used in translation - compares to human generated translations
46
How should BLEU & ROUGE scores be used?
They are simple metrics used for iteration and diagnostic evaluation on translation & summarization tasks specifically. They shouldn’t be used to report on the overall model performance; that's what benchmarks are for.
47
What is a ROUGE score & how are calculations typically performed?
ROUGE: a suite summarization metrics that stands for Recall-Oriented Understudy for Gisting Evaluation. The core idea is usually based on recall, measuring how much of the reference text is captured by the predicted text. Calculations: ROUGE -1 Recall: # of unigrams matches in output / unigrams in reference Precision: # unigram matches / # unigrams in output F1: 2* (precision X recall) / (precision + recall) Calculations : ROUGE-2 would use bigrams instead to do calcs. Scores would be lower. Rouge-L: instead of picking the n-gram size, use the largest common sub-sequence length that matches across the reference & prediction. E.x: if two grams match then 2 but if three grams match then 3 It is cold outside It is very cold outside L = 2
48
How should ROUGE scores be considered across tasks? What are some limitations of ROUGE scores & how are they mitigated?
Different ROUGE calculations should be done across different tasks Since it’s just using a “Match” nonsense responses can be rated highly Reference: It is cold outside Prediction: cold cold cold cold Has a perfect precision. This can be mitigated through clipping, limiting the number of unigrams matches to the maximum times it appears in the reference, i.e. Cold appears 1 time in the reference 1 → .25 since it appears 1 time in the reference Still hard: Reference: it is cold outside Prediction: outside cold it is Perfect score deceivingly. As a result, you may need to experiment for difference n-gram sizes for different tasks
49
What generally happens as the grams considered in an LLM metric increases?
The performance is worse
50
How is the BLEU score calculated?
Average(Precision Across range of n-gram sizes). Precision oriented compared to ROUGE being recall-oriented. Core idea: on average, how many grams from prediction appear in reference. Limitations: - Short predictions are favored, so introduces a brevity penalty. - Uses clipping so tokens in prediction can only be counted correct for as many times as they appear in reference BLEU score calculation: It's the average precision across a range of n-gram sizes.
51
What is the role of benchmarks for LLMs?
To evaluate model performance. Since often simpler metrics like BLEU & ROUGE are not sufficient to assess overall model performance, benchmarks help to provide a better avenue to report on model performance across a wide range of tasks. Some benchmarks are designed to measure specific tasks.
52
What are some of the common benchmarks & what do they generally measure?
GLUE: General Language Understanding - Description: collection on NLP tasks (sentiment analysis, question answering) from 2018 - Measures: generalized model performance SUPER-GLUE - Description: successor to GLUE launched in 2019 - Measures: has some of GLUEs tasks, additional tasks, and more challenging versions of the same tasks (Multi-sentence reasoning, Reading comprehension) MMLU: Massive Multitask Language Understanding - Description: must possess extensive world knowledge & problem solving - Measures: mathematics, law, us history, computer science, i.e. tasks beyond language understanding BigBench: - Description: 204 tasks from linguistic to biology to social bias to SWE. 3 different sizes HELM: holistic evaluation of language models - Description: improve transparency of models & offer guidance on which models perform well for specific tasks - Measures: measures 7 metrics across 16 scenarios - Metrics: Accuracy, Calibration, Robustness, Fairness, Bias, Toxicity, Efficiency
53
What are some considerations when running benchmarks for a model?
The size of the benchmark because it can incur significant inference costs, which means some benchmarks have various sizes to help ensure researchers continue to have access to run evals on models (their own & industry) The relevance of the benchmark to the task(s) you are attempting to have the model to be good at (reasoning, risks, etc.) Has the model seen the evaluations data during training - if so, it’s likely not a good measure of performance
54
Describe precision & recall. Analogy Phrasing: of X, the Y Focus Question Formal Equation
Precision: - Analogy: I'm trying to catch Tuna via fishing, of all of the fish I catch in my net, what % do I get that are Tuna, i.e. how precise am I. - Phasing: "Of the things I predicted [as positive], the percent that are correct." - Focus: The items you selected or predicted as positive. - Question: How accurate were my positive predictions? (Minimizing False Positives) - Formal Equation: True Positives / (True Positives + False Positives) Recall: - Analogy: Of all of the Tuna in the lake, how many did I actually catch? -Phrasing: "Of the things that are correct [actually positive], the percent I predicted." Focus: The items that are actually positive in the whole dataset. Question: How many of the actual positives did I find? (Minimizing False Negatives) Formal Equations: True Positives / (True Positives + False Negatives)
55
What problem(s) does PEFT solve?
The cost of fine-tuning an entire LLMs is high because it requires a lot of compute, & each different task to fine-tune on creates a large new version of the model to store & use for inference. Parameter-efficient fine-tuning, via freezing at least some of the model weights (often 80-100%), makes it more efficient to perform fine-tuning & have various versions to use for inference in different tasks PEFT also helps prevent catastrophic forgetting as a result of keeping most model weights frozen
56
What are the tradeoffs to consider among PEFT methods (5)?
- Parameter efficiency: i.e. parameter:training data ratio - Memory efficiency: i.e. freezing more weights vs. reparameterization more weights - Model performance: i.e. forgetting - Inference costs: i.e. adding weights at inference - Training speed: i.e. changing more weights
57
What are the three main approaches to PEFT, their benefits/drawbacks, & any subdivisions within them? Paper?
“Scaling Down to Scale Up: A Guide to Parameter Efficient Fine-tuning” Selective: identify a subset of model weights for tuning & freeze the rest - Can select specific model components, different layers, or specific parameter types - Benefits: N/A - Drawbacks: mixed results Reparameterization: create new low-rank representations of the original network weights (LoRA) Benefits: doesn't increase cost of inference Drawbacks: - Additive: Add new parameters to be able to train - Adopters: add new layers in encoder/decoder after the attention or feed-forward layers. - Soft-prompting: keep the model fixed but add layers focused on modifying the input in the prompt embeddings or keeping input fixed & retraining the input embeddings
58
What is LoRA? How does LoRA work? What are some benefits & considerations of LoRA?
Low Rank Adaptation is a technique used in fine-tuning, classified as reparameterization, that is quite popular. 1. The original model weights are frozen (Attention & Feed-forward layers) 2. Matrics that are much smaller (rank decomposition matrices) but multiply to be the same dimensions as the original parameters are randomly initialized & then trained during fine-tuning in conjunction with the frozen weights. 3. At the end of training, the low-rank matrices are multiplied to get the same size as the frozen parameters & then they are integrated into the original parameters by addition: original parameters + Low Rank Matrices weights (multiplied) 1. It’s super efficient for fine-tuning since often you only need to do this on the attention layers generally (where most weights live in LLMs), though it can also be done on the feedforward layers 2. Often can perform LoRA with a single GPU & avoid need for distributed training
59
What is the Rank of a matrix? What is rank decomposition? What’s a good analogy for this?
The rank of a matrix essentially represents the number of linearly independent rows or columns it contains. It gives a sense of the "information content" within the matrix. A full-rank matrix has its maximum possible rank, while a low-rank matrix has a rank significantly lower than its dimensions. Rank decomposition is the process of expressing a matrix as a product of two or more matrices with lower ranks. If you have a large matrix A, you can decompose it into matrices B and C such that A = BC. The ranks of B and C are typically lower than the rank of A, hence the term "low-rank decomposition." Analogy: The Matrix as a Sign: Think of a matrix as a large, detailed sign you want to create. It has lots of intricate designs and elements. Rank as Essential Design Elements: The "rank" of the matrix is like the number of truly essential design elements you need to create that sign. If your sign is just a bunch of variations of a few basic shapes and colors, its "rank" is low. You don't need a lot of unique instructions to create it. But if your sign has a lot of unique details and complexity, its "rank" is high. Rank Decomposition as Using Stencils: "Rank decomposition" is like using stencils to make your sign. Instead of drawing every single detail by hand, you create a few simple stencils (these are your smaller, lower-rank matrices). Each stencil represents a basic pattern or element. Then, you overlay and combine these stencils in different ways to create the final complex sign. In this analogy: - The original, complex sign is the original matrix. - The simple stencils are the rank decomposition matrices. - The number of truly unique stencils you need is the rank of the original sign (matrix).
60
Given a transformer that is 512X64, you want to use LoRA with a rank of 8. What are the dimensions of the LoRA matrices?
Matrix A: 8X64, Matrix: B: 512X8 86% reduction in parameters to train
61
If you want to use LoRA for multi-task fine-tuning, what consideration do you have? How does the performance of LoRA compare to something like a pre-trained only FLAN-T5 model and a fully fine-tuned version of that model?
You might want to fine-tune for each task (unique parameters & then at inference time switch them out based on the task of inference ~77% better than the pre-trained only version using ROUGE ~3% worse than a fully fine-tuned version (which is in most cases probably worth it)
62
How do you choose the rank of the LoRA matrices?
Still an area of active research, but generally there seems to be a cliff above r=16, so the recommendation is r between 4 & 32.
63
What is soft-prompting? How is it different from prompt engineering? How would you do it for multiple tasks?
A method of parameter efficient fine-tuning that is “Additive”, where additional tokens (20 to 100) are included in the input prompt with the same length as other tokens & randomly initialized. This is also known as prompt tuning. The weights are learned over time while the rest of the model weights are kept frozen, allowing for tokens that don’t actually correspond to language to be utilized in instruction-fine-tuning. Prompt engineering is about manipulating the words of the inputs prompt, while soft-prompting is about allowing the algorithm to learn representations that don’t correspond with actual input tokens. You can have different soft-prompt tokens for different tasks and then choose at inference which set to use.
64
When is soft-prompting effective?
Soft-prompting is effective when: - Computational resources for full fine-tuning are not available or not worth the cost (it can be 100X less expensive), soft prompting is usually 10k-100k parameters updated. - The model is of sufficient size, i.e. >10 million parameters typically - It becomes on-par with full fine-tuning >10 billion parameters - The interpretability of the input tokens, given they don’t correspond to language, is not critical
65
How do you handle interpretability for soft-prompt tokens?
It’s difficult but the words with the closest representations within the space are generally considered to be similar concepts
66
What is QLoRA and why use it?
Quantization Low Rank Adaptation - further reduce memory footprint
67
What approach do most people mean when they say “we will do PEFT for this model”?
Typically LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU)
68
What is RLHF? What is it primarily used for?
Reinforcement learning with human feedback is used to help align models to human values after fine-tuning, & it is a key lever in the responsible AI toolkit.
69
What are some of the topics for using LLMs in applications beyond the training processes covered?
Using LLMs as reasoning engines that have access to tools (via APIs) like search to then choose what tools to use when vs. just using LLMs for fact generation. RAG: retrieval augmented generation to ground the LLM with specific context based on what’s been asked for.
70
What are the three areas RLHF drives alignment & what values is it focused on?
Toxicity, Dangerous Information, Aggressiveness Honesty, Helpfulness, Harmlessness While fine-tuning often makes language more human-like, it might not make it honest, helpful, & harmless to the degree we need.
71
How does RLHF relate to fine-tuning & when does it occur? What’s the impact of it on performance?
RLHF is a type of fine-tuning but generally happens after instruction fine-tuning It leads to better performance generally than pre-trained only or instruction fine-tuning alone.
72
What are the key components of reinforcement learning? How does it progress over time?
Agent: the actor, which has a “policy” for how to navigate the environment which it can update by taking an “action”. Environment: what can be modified by the agent. The environment has a sate. Reward Function: the way the agent understands whether changes to the environment were good or bad. It progresses over time via taking somewhat random actions, which change the environment, & it gets feedback via the reward function, which it then uses to update it’s policy to take actions more aligned with the reward policy in the future. It starts out randomly & becomes more aligned through iteration.
73
What is a good example of RLHF? How does an LLM fit into this paradigm?
Tic-tac-toe: - Agent: policy on how to play the game - Environment: the tic-tac-toe board - Objective: win the game - Reward: closer to winning the game? LLM: - Agent: Instruct LLM with the Policy being the LLM - Objective: generate aligned text - Environment: the LLM context window where the prompt can be entered - State: what’s contained in the current context window - Action: generating text, with the action space being the tokens i the vocabulary to choose from - How it generates text depends on the existing context & the probabilities it learned during training - Reward: often human feedback or derivation of human feedback - Humans reviewing output for a specific measure (honest, harmless, helpful) - OR a supervised model that’s been trained on human feedback to scale providing it - a.k.a. the reward model
74
In the context of language modeling the sequence of actions & states is called? Compared to the classical?
Rollout for LLMs Playouts classically
75
What are the process steps for humans to provide feedback for RLHF? What are considerations when choosing who should provide the feedback and how to set them up for success?
1. Choose an instruction fine-tuned model that’s suitable for the task (often something that’s been trained on multiple relevant tasks & has some world knowledge) 2. Have the model generate based on a prompt multiple sets of completions for that prompt 3. Choose what you will have people evaluate the model for (helpfulness, harmlessness, honesty) 4. Have multiple people (based on the evaluation criteria) rank the completions for each prompt with different people ranking the same completions You want people to have diverse & representative skill sets & backgrounds to ensure the reward model trained as a result of their inputs covers your target measures in a well-rounded manner. The instructions they receive should be detailed & clear, otherwise raters may provide conflicting results on clear completions.
76
What are some important aspects of the instructions provided to people doing RLHF?
Rank Assess based on X, Y. You can use Z tool In case of a tie, do A For nonsensical completions do B
77
Give an example of RLHF provided for the prompt “my house is too hot”.
Model might provide the following responses: - “There is nothing you can do about hot houses” - “You can cool your house with air conditioning” - “It is not too hot” Criterion: helpfulness Rankings from 3 people: - Option 1: 2, 2, 2 - Option 2: 1, 1, 3 - Option 3: 3, 3, 1 The third labeler probably misunderstood the instruction
78
How is RLHF ranking data prepared for model training?
Ranks are turned into pairwise training data for the reward model Where each of the completions is “paired” with all of the other completions, so based on N completions, you’ll have N*N-1 pairs. Within each of those pairs, the completion that is preferred should have a 1 assigned to it and come first, & the non-preferred completion should have a 0 assigned to it and come second
79
What is the role of the reward model in the RLHF process? What is the model type for rewards models generally in LLMs?
To encode the feedback from humans and then take their place in the tuning process to provide that feedback at scale to tune the model Usually a language model, like BERT
80
How does the reward model relate to the reward function? What’s the role of logits?
It takes the possible completions as input, and outputs the preferred option via logits, then it minimizes the difference between the reward value of the preferred completion & the one chosen. Logits are the non-normalized precursor to the binary class prediction of the reward model The positive logit value is what’s used in the reward function & is a precursor to a probability (transformation occurs with softmax function application)
81
What are the steps once you have your reward model to fine-tune an instruct model?
1. Pass a prompt to the instruct model to generate a completion 2. The reward model will evaluate the completion with a more positive score being better 3. The loss compared to the reward function will be evaluated 4. A reinforcement policy algorithm (proximal policy optimization) will give the feedback to the instruct model and adjust weights to tune it 5. the updated instruct model will generate a new more closely aligned completion Repeat until a threshold is met or number of maximum steps is reached
82
What is the proximal policy optimization algorithm?
It is the reinforcement learning algorithm, that when paired with the rewards model loop can provide the reward policy (i.e. the LLM) with the updates required for it’s model weights to better align with the reward function
83
What is reward hacking? What’s an example?
When a model in RLHF degrades it’s performance noticeably on the original task via learning to provide responses that maximizes the reward, even if it means its task completion performance suffers. A model trained to provide product reviews has RLHF for toxicity, it could go from saying: - This product is…[a dumpster fire] - This product is…[really the most awesome product ever] The value of the second completion generates a better reward even if it means the original task performance degrades
84
How do you protect a model from reward hacking?
You keep a copy of the original model with frozen weights as a reference model & then compare the RLHF completions against these completions. The comparison is done via the KL Divergence metric, which is then incorporated in the RLHF process to keep it minimized to a certain degree so the model stays true to its original task
85
What is KL Divergence? How does it work in RLHF? What’s a challenge & mitigation strategy for it?
Kullback-Liebler divergence - a statistic comparison of how different two probability distributions are. It’s calculated across all of the tokens in the vocabulary for an LLM. Use a softmax function to reduce the number, but it’s quite computationally expensive so should be done on GPUs. KL Divergence penalty is added to the reward function. It is quite computationally expensive & requires keeping two LLMs in memory to accomplish, but if you use PEFT, it often means you don’t need two full LLMs as copies and instead can use adapters or other techniques to only keep 1 LLM in memory + the modified weights
86
What is a key challenge in scaling RLHF & what solution/research area has been proposed? What’s the process?
Gathering the thousands/millions of pieces of human feedback to train the reward model. Reinforcement learning with AI feedback via Self-Supervision, based on constitutional AI. Proposed by Anthropic in 2022, constitutional AI is when an LLM is provided a set of prompts (rules) that make up its “constitution”, meant to help it govern values & make tradeoffs among them, like prioritizing harmlessness over helpfulness. Paper: “Constitutional AI: Harmlessness from AI Feedback” Part 1: supervised learning - fine-tune from self-critique & iterative response 1. Having an LLM provided with its constitution 2. A team trying to red-team the model under consideration by getting it to elicit bad responses (like “How do I make a bomb?”) 3. Have the model critique it’s own responses, using a constitution 4. Have the model regenerate the response based on it’s critique 5. Fine-tune the model based on that new response Part 2: Ask model which response was preferred, the original or the new response & use RL based on those pairs to encode LLM preferences
87
What are three methods for optimizing LLMs for deployment? What general objective do these solve? How do each of these work?
Model Distillation, Post-Training Quantization, Model Pruning Computational efficiency and storage for inference, via reducing model sizes. Model Distillation: most popular approach, where the fine-tuned LLM (Teacher) has it’s weights frozen, and create predictions for the training dataset. A student LLM then creates predictions and is trained via the “distillation loss” to learn the knowledge of the teacher LLM. Post-Training Quantization (PTQ): similar to how Quantization Aware Training, takes the model weights and converts them to a low precision storage, post-training quantization does the same thing but for inference. Model Pruning: weights that are close to 0 are pruned from the model, resulting in a smaller model & often improved performance (though usually this is minimally effective for LLMs)
88
What are the detailed components of model distillation? What model type is distillation typically effective for?
Use Teacher LLM with frozen weights to generate labels from training data, a.k.a. soft labels Use Student LLM to generate predictions from training data, a.k.a. soft predictions Knowledge Distillation: distillation loss functions compares the probability distribution of teacher to student & tries to minimize the loss, however a Temperature parameter can be set where a higher value lets the student model learn to be more creative & a lower value <1 let’s it learn to be more exactly like the Teacher model Since we also have the “ground truth” data from the original training dataset, the student model also learns to predict “Hard predictions” which are compared against “hard labels” & the T parameter here always equals 1 and is not varied. Distillation is mostly beneficial for Encoder models where there is a solid representation generated & not decoder-only models
89
What are some of the details of PTQ?
Post Training Quantization Take 32 bit float representations and often converts them to 8 bits Can be applied to both model weights & model activations (though activations tend to have a larger impact on performance)
90
What are the ways that pruning is applied? What’s the typical impact?
Full-training PEFT/LoRA Post-training Usually has low impact on LLMs since it targets weights of 0
91
What kinds of user queries do LLMs often struggle with that can be mitigated via clever orchestration in applications? How are these mitigated?
Struggle with - Out of date information - Math - since it’s just guessing the next token - Making up facts that aren’t actually true This can be mitigated via tool use to access: - Up to date information - Complete mathematical operations - Looking up facts via RAG for grounding things in sources
92
What paper was published by Facebook in 2020 about RAG? What did it demonstrate & what was the setup?
Retrieval Augment Generation for Knowledge Intensive NLP Tasks Encoded the user query → looked up against a vector database relevant information → with relevant information & user query, passed it to LLM for completion
93
What does RAG help you mitigate? What considerations are there for using RAG?
Mitigates: Knowledge cutoffs & Hallucination Data must fit in the context window so documents often need to be broken up into chunks for embedding
94
What is the difference between a vector store & a vector database?
A vector database is just a vector store (i.e. text + vector representation) where there is also a unique key for each vector representation This enables citation
95
What does an LLM need to have success in an application setting?
A plan: it needs a set of actions/steps to follow Specific format for it’s output: i.e. like a SQL query Validate Actions: if it’s taking actions, it needs to have its outputs validated by the user to ensure it’s taking the correct actions
96
What is Chain-of-Thought Prompting?
Via in context learning, provide examples where to get to an answer, the reasoning steps are broken out. This primes the model to do the same when producing its own results, leading to the LLM to perform more like a person when handling reasoning problems.
97
What is Program Aided Language Modeling? How does it work?
Using a program to enhance a language models ability to provided precise mathematical operationally dependent answers A model, via in-context-learning has Chain of Thought prompting demonstrated where the chain of thought is the pairing of breaking down the problem (in comments) to the associated code. A new question is put at the end of the example prompt to have the code steps generated for it The code steps are then passed to a python interpreter for computation The answer is then put with the PAL formatted prompt and passed back to the LLM for it to incorporate the answer in context
98
What research paper came up with ReAct & by who? What is it & what are it’s components?
“ReAct: Synergizing Reasoning and Acting in Language Models” - Princeton & Google A framework to help LLMs plan out & execute more complex workflows. Combines CoT reasoning with action planning. - Question: what the model is asked - Thought: a reasoning step on how the model might address the question - Action: choosing from a set of actions it could take (lookup, search, or finish) - Observation: incorporate information from Action into the prompt Repeat Thought → Action → Observation sequence until model is confident it’s gotten the answer and chooses the “finish” action Benchmarks - HotPot QA: multi-step question answering from Wikipedia sources - Fever: uses Wikipedia passages to verify facts
99
What does a set of instructions for the ReAct framework look like? And what is their relevance?
Do X by using Thought, Action, and Observation steps. Thought can reason about the current situation & Action can be one of the following: 1 First action [Lookup] 2. Second action [Search] 3. Final [Finish] Here are some examples…
100
What is LangChain?
Allows breaking up LLM problems into different components, making implementing things like ReAct easier