LLM Modeling Flashcards
What model did we use in the class
Flan-T5
RNN
recurrent neural networks (PREVIOUS GEN) à only to word before it
LLM
large language models = all words to each other; and weight of attention/influence between the words
Tokenize
Convert each word into numbers (which is store in a vector)
Self Attention
analyzes the relationships between the tokens
Encoder
inputs prompts with contextual understand and outputs vector
Decoder
accepts inputs token and generators token
Sequence to Sequence
encoder to decoder model; Translation, text summarization, answering questions, is a sequence to sequence task (T5, BART)
Decoder only mode
good at generating text (GPT)
Zero Shot Inference
pass no grading/sentiment (type of In-context learning (ICL))
One shot Inference
pass one grading/sentiment (type of In-context learning (ICL))
Few shot inference
pass few grading/sentiment (type of In-context learning (ICL))
Greedy
always take the most probably word, which will have the outputs be the same over and over again
Big data
When LMM is too big for a single GPU
DDP
Distributed Data Parallel
Fully Shredded Data Parallel (FSDP)
BIGGER SCALE reduce memory by distributing/sharding model parameter across GPUs
Three main variables for scale
1) Constraints (GPU, time, cost) 2)Data set Size (number of token ) 3) Model Size (number of parameters)
Chinchilla
very large models may be over parameterized and under trained à thus less parameters but feed it more data verse bigger and bigger
Fine Tuning an existing model
the output is a new model
Specific Use Case Training
500 examples (via prompt completion pairs) –> resets all parameters –> could lead to catastrophic forgetting (but that may not matter for a single type of use implementation) –> Very CPU intensive
Multiple Use Case Training
1000’s of examples across multiple tasks –> resets all parameters –> less likely to have catastrophic forgetting since its trained across multiple tasks
PEFT (Parameter Efficient Fine Tuning)
small number of trainable parameters; rest are frozen –> MUCH MORE EFFICIENT
Reparametrize model weight (LoRA)
i. Freezes most of original weights
Inject matrixes to update model; updates the non frozen
Additive
add training layer or parameter to model –> KEEP THE ENTIRE EXISTING MODEL