LLM Modeling Flashcards

1
Q

What model did we use in the class

A

Flan-T5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

RNN

A

recurrent neural networks (PREVIOUS GEN) à only to word before it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

LLM

A

large language models = all words to each other; and weight of attention/influence between the words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Tokenize

A

Convert each word into numbers (which is store in a vector)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Self Attention

A

analyzes the relationships between the tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Encoder

A

inputs prompts with contextual understand and outputs vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Decoder

A

accepts inputs token and generators token

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sequence to Sequence

A

encoder to decoder model; Translation, text summarization, answering questions, is a sequence to sequence task (T5, BART)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Decoder only mode

A

good at generating text (GPT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Zero Shot Inference

A

pass no grading/sentiment (type of In-context learning (ICL))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

One shot Inference

A

pass one grading/sentiment (type of In-context learning (ICL))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Few shot inference

A

pass few grading/sentiment (type of In-context learning (ICL))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Greedy

A

always take the most probably word, which will have the outputs be the same over and over again

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Big data

A

When LMM is too big for a single GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DDP

A

Distributed Data Parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Fully Shredded Data Parallel (FSDP)

A

BIGGER SCALE reduce memory by distributing/sharding model parameter across GPUs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Three main variables for scale

A

1) Constraints (GPU, time, cost) 2)Data set Size (number of token ) 3) Model Size (number of parameters)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Chinchilla

A

very large models may be over parameterized and under trained à thus less parameters but feed it more data verse bigger and bigger

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Fine Tuning an existing model

A

the output is a new model

20
Q

Specific Use Case Training

A

500 examples (via prompt completion pairs) –> resets all parameters –> could lead to catastrophic forgetting (but that may not matter for a single type of use implementation) –> Very CPU intensive

21
Q

Multiple Use Case Training

A

1000’s of examples across multiple tasks –> resets all parameters –> less likely to have catastrophic forgetting since its trained across multiple tasks

22
Q

PEFT (Parameter Efficient Fine Tuning)

A

small number of trainable parameters; rest are frozen –> MUCH MORE EFFICIENT

23
Q

Reparametrize model weight (LoRA)

A

i. Freezes most of original weights
Inject matrixes to update model; updates the non frozen

24
Q

Additive

A

add training layer or parameter to model –> KEEP THE ENTIRE EXISTING MODEL

25
PROMT ENGINEERING
one shot inference etc
26
PROMPT TUNING
Prompt Tuning: It fine-tune the LLM but with the structured data, which is consisted of some contents like "instruction", "response" etc. - Fine Tuning: It fine-tune the LLM with unstructured data like raw text.
27
FLAN
Fine- Tuned Language Net are the specific instruction used to perform fine tuning Flan - T5; FLAM-PALM is the tuning of the t% and PALM model
28
Rogue
used for text summarization, compares to 1 or more reference summaries
29
Recall
I very long response can have a recall of 100% but be too wordy
30
Precision
how many extra words are there in the output?
31
F1
A ratio of recall and precision
32
BLEU
used for text translation; compares to human generated translation
33
RLHF
Reinforcement Learning from Human Feedback - Tuning a model to be helpful, honest, harmless (Three HHH) Have humans tag how 'good' a response by comparing 3 options (how helpful? Or how harmful? Or how honest?)
34
PPO
Proximal Policy Optimization - a popular algorithm who helps solve reinforcement learning problems. Makes updates within a very small region (proximal) to LLM via many iterations to bad handle HHH
35
Reward model
supervised learning taking your human prompts and 'rewarding' the human tagged responses [comparing classes hate vs not hate probability)
36
Reward Hacking
- where a model tries to optimize it's scores by making answers that are long and wordy Avoid reward hacking by comparing to Reference Model via KL Divergence Shift Penalty
37
Constitutional AI
allows you to scale Reinforcement Learning without human intervention. Constitutional AI (CAI) is similar to RLHF except instead of human feedback, it learns through AI feedback.
38
LLM Optimization Techniques
Distillation - train a smaller student model from a larger teacher model Post Training Quantization (PTQ) - reduce precision of model weights (aka from 32 bits to 8 bits) Pruning - remove model weights with values close to or equal to 0 (in theory makes sense to reduce , but in actuality there may not be many weights are Zero or close to 0
39
3 types of issues of models
1) Out of date 2) Bad at math (can't do calculation) Hallucinations and guessing the answers it doesn't know
40
How to mitigate issues with models
RAG-retrieval augmented generating; get the details directly from a DB/API then pass I to the model Chain of Thought --> Provide hints of how to break the problem into smaller parts [good for simple problems] Program Aided Language Models (PAL)- integrate with python to do the math and model to call python
41
Responsible AI - how do mitigate:
Toxicity [curating training data, train guardrail models, diverse group of human annotators], Hallucination [educate/add disclaimers] Intellectual Property [not easy, machine 'unlearning', filtering blocking]`
42
Existing metrics to measure hallucination
1) Rogue --> compare to an expected results 2) Ask Chat GPT to grade 3) Probability checks
43
A new way to score LLM Hallucinations (Gailio)
ChainPull Pass results to chain of thought's model which includes a score and a logic path
44
45
What does GPT stand for?
Generative Pre-trained Transformers, commonly known as GPT, are a family of neural network models that uses the transformer architecture and is a key advancement in artificial intelligence (AI) powering generative AI applications such as ChatGPT.