Lecture 16 - Transformers and ChatGPT Flashcards

1
Q

Neural Networks again: Word embeddings

A

Example: figuring out how similar someone is to you in terms of personality
The Big 5:
Extraversion
Neuroticism
Openness to experience
Agreeableness
Conscientiousness

Example if I have 100/100 extraversion and 46/100 neuroticism
Person 1 has 100/100 extraversion and 25/100 neuroticism
Put that on a scale from -1 to +1 (this is the Person Embedding)
100=1, 25= -0.5
Next, put that on a graph with vectors
Then you compare how small the angles are between the two vectors to see how similar me and person 1 are (cosine similarity)

Cosine similarity (angle between 2 vectors)
Example with personality scores
Lowest cosine distance = most similar

We can do this with words too! Word Embeddings
Ex:
Put the word embeddings beside the word King
Word embeddings = Coloured values beside king = values between -1 and +1 (like the personality traits)
Same thing for words like man, woman
Figure out that
king - man + woman = approximately queen

Word embeddings are achived with Proxy tasks !!
Ask NN what the emoji with the highest amount
Will learn implicitly to count
*we never ask directly!!!

Continuous Bag of Words
Ex: The __ sat on the throne.
Network —- queen
CBOW is supervised learning
Learns word embeddings through large numbers!

Country and Capital Vectors
Since roughly same distance between country and capital, all capture the concept of capital city

Summary:
● Word embeddings
○ …make sure similar words have similar representation
○ Achieved via proxy task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Language Prediction (Recap)

A

Letter to Letter
Word to Word
Word to Word to Word to…

How many words do we need to look at in English
to meaningfully predict the next word?

The more words you look at, the more conceptual questions you can answer

Ex: ChatGPT4 can be fed up to 24,000 words or 48 pages per prompt
Whereas ChatGPT3.5 only 3,000 words or 6 pages

ChatGPT uses Tokens
Many words map to one token
Sequences of characters commonly found next to each other may be grouped together (one token). Ex: 1234567890
Tokens are parts of words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Transformers

A

Previously used rnns, but new transformer method

Key-Value Storage
Key = title, concept value = body, explanation
Ex:
Key = sky, Value=blue
Key=Tabasco, Value=potato man

Query: Sky?
Answer: blue
Query: Tabasco?
Answer: potato man

Key-value storage is not a transformer yet, just the idea of transformers

The sky is…
-Query: sky is
Look at key-value storage:
–sky: blue
–tabasco: potato man
…blue

For transformers:
Convert last word to query, key and value
-“is”
What does the is refer to, what is the answer
Matrix multiplications
All different vectors
Cosine similarity between query and key vectors
Blue is stored in the yellow (value) of sky (key) (value… stored knowledge)
*see images

Summary:
● Transformers
○ Based on key-value database idea (key = index, value = information)
○ Convert each word into query that gets compared to all wordʼs keys → retrieve best value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ChatGPT: AI vs. AGI

A

Narrow/weak AI
= good at one task (like chess robot, good at chess but not at other stuff)

Artificial General Intelligence (AGI)
=Good at all the tasks
Can learn new tasks
Can do anything a human
can and better
-ChatGPT closest to AGI (we don’t have AGI yet)

First, Unsupervised Pre-training
-Expensive training on massive datasets
300 bilion tokens of text
Objective: predict the next word
train to give correct output
Before: Untrained GPT3 with random values
Now: we have GPT3 with appropriate values

How does GPT predict?
● Look up word (token) embedding
● Calculate query vector
● Calculate keys/value vectors for all other words
● Take value of best match, convert it to output

GTP = Generative Pre-trained Transformer (GPT)

System prompt
System prompt secretly there before user request
Instructs chatgpt
System: You are a large language model…
User: What’s the name of the furry animal with 4 legs?

System: You are a large language model…
User: What’s the name of the furry animal with 4 legs?
Model: I think you mean a dog

System: You are a large language model…
User: What’s the name of the furry animal with 4 legs?
Model: I think you mean a dog
User: No, the other one

System: You are a large language model…
User: What’s the name of the furry animal with 4 legs?
Model: I think you mean a dog
User: No, the other one
Model: Oh, a cat maybe?

System: You are a large language model…
User: What’s the name of the furry animal with 4 legs?
Model: I think you mean a dog
User: No, the other one
Model: Oh, a cat maybe?
User: That’s the one! Thanks

This is how it remembers the conversation!

How to remove the system prompt:
1) Pay and use the API
2) Look up “ChatGPT Jailbreak
Prompt”, e.g.
https://docs.kanaries.net/articles/chatgpt-jailbreak-prompt

Datasets matter!
Wikipedia
CommonCrawl + RefinedWeb
Reddit, Youtube, Twitter?
Textbooks?
Problems?
Could be very biased
If incorporates all internet, will include some dodgy things

Uses Reinforcement Learning from Human Feedback
RLHF
-example: putting a like on the response
Problem: Sparse Reward: we only have rewards for the sentences we tried out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ChatGPT: Reward (preference) model and Mixture of Experts

A

Reward (preference) model: purpose is to estimate how well a human would like the answer (doesn’t predict the next word)
-Adjusted for human preference, how it gets better

Mixture of Experts
-Training 16 different models
Input… Expert 1-16
Gating Network
Output

Context summarization
8000 tokens → 8001st token ✅
8001 tokens… ❌
Solution: summarize!
8000 tokens → 1000 token summary.
Anna Karenina:
350,000 words,
860 pages

Why is it so good?
Money!
$63 million USD
● Dataset procurement 💰
● Dataset storage 💰💰
● Initial training 💰💰💰💰💰💰💰💰 x 16
● RLHF 💰💰💰💰💰💰💰
● Fine-tuning 💰💰💰💰 x 16
● Inference 💰💰💰💰💰💰💰

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Summary

A

●Word embeddings
○ …make sure similar words have similar representation
○ Achieved via proxy task
● Transformers
○ Based on key-value database idea (key = index, value = information)
○ Convert each word into query that gets compared to all wordʼs keys → retrieve best value
● ChatGPT
○ “Just a large transformer” - based on GPT3/4 (“Generative Pre-trained Transformer”)
○ Trained to predict next word based on 8000 token context
○ Fine-tuned based on human feedback with RLHF (approximate human reward model)
○ Mixture of 16 experts
○ Uses proprietary blend of 11 herbs & spices (datasets, textbooks, Reddit/YT scrapes, etc.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly