Large Language Models Flashcards
(27 cards)
How are people at predicting information based on faces?
Election example
-People exposed to two face for a fraction of a second and asked to predict which one they think will win a political election
one they think will win
-Doesn’t mater how long you look at that face
-Results were similar for 100 ms, 250 ms, and unlimited time
-About 55% of people accurately guessed based on a fraction of a second
Conducted a similar experiment for who is more likely to be promoted to a general; Positive correlation between facial dominance score and percentage of people who thought they would become general
What is a loss function desgined for in a LLM?
for a specific task
What is the primary critique of AI?
Many AI models are specifically designed for one specific application, although the media makes it seems very close to human cognition (still very far)
When was the first time AI wasn’t designed for one thing?
ChatGPT
Artificial General Intelligence
A very general mental capability that involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas
What is the foundation model?
-Not trained to do one thing, it is trained to act like AGI
-Trained by one loss function applied to all these tasks
What is the master loss function for AGI and how do you build it?
Next word prediction; built by turning words into numbers
What is the input and output in a LLM?
Beginning of a sentence in input, the word it predicts is the output
Explain the temeprature analogy for LLMs
Low temp means the system is very probabilistic
High temperature means the model behaves very randomly
If the temp is 0, meaning no randomness, and you give the same prompt, the model will give the same answer
What are the neurons in a LLM?
Each neuron corresponds to each word in the dictionary
Explain how temperature affects the LLM
-User sets the temperature in the code
-Model will give a probablity for the next word
-If temp > 0, meaning there is some randomness, the model will generally pick the words with high probability but low probability words also have a chance
There is an upper limit on the temp. When it’s at the highest limit, the model will just spit out random words
Explain how a LLM works
The first word is generated from a prompt, then it is sent through the model again; after each word is added the response is sent back through the model
E.g. if it gives you an answer of 10,000 words, the response was run through the model 10,000 times
It bypasses how we acquire knowledge and goes directly to the outcome of our collective intelligence
Explain how LLM uses vectors.
-Each word is converted into a vector (each word is represented by a a long list of zeroes, except for one spot that is one; the position where 1 is located correspond to each word)
-Like an image is made of pixels, the entire sentence is just a stack of vectors, using one hot representation
-the one hot representation is the input and is condensed into a much shorter code (this code is embedded in the middle)
-Then you make a prediction; when looking at a sentence, you look at the words before and after the highlighted word, allowing you to make a pair of data — given the current word, what should the previous word be?
If the dictionary contained 500,00 words, there’d need to be 499,999 0s in the vector
*How are LLMs trained?
Gather his data from all sentences that are available to it (e.g. journal articles, news articles)
This is the training data and the ground truth
You ask the model to make a prediction on the next word and then compare it to the ground truth
Once the neural network is trained and transforms each one hot representation into a code
Unsupervised learning because the data itself is its own label
*What is the input layer in a LLM?
One word from the one hot representation is is the input layer, not multiple words
i.e. also the embedding
What is the ouput in a LLM?
The probability of each word based on the training data
Will adjectives have a similar code?
Yes, the code will be similar although the input layer will be very different
E.g. “tired” and “exhausted” will appear completely different in the input layer and have a different one hot representation but the code will actually be very similar because they are surrounded by similar words in sentences
Vector
an arrow starting from the origin, pointing to the coordination
-Traditional AI would search the dataset end the efficiency of AI was based on how fast it could search. Now, instead of searching through the data set, you need embedding of the vectors
This is how generative AI views language
Basics of vector math
-you can subtract two vectors to get another vector
-e.g. C - A = B
e.g. China - Beijing = Russia - Moscow
How is the position of the vector determined?
by their embedding (the x and y coordinates)
What was originally the purpose of using language in neural networks?
For computer vision
Language, vectors, and bias
The patterns of how words are associated with each other reveals a lot about human society
The bias is not in the research or the algorithm, it’s in the use of our language
E.g. computer programmer - man + woman = homemaker
How does a LLM process sentences?
When the LLM processes a prompt, it sees the entire prompt all at once. It is not like reading for humans where we process sentences word by word
*Makes the computation much faster
*Complexity of document doesn’t make a difference to a LLM as long as they’re the same length (e.g. a children’s book vs a quantum physics textbook are processed the same)
What is a qualitative difference between human thinking and next word prediction?
AI prioritizes satisfying the request at the expense of making sense
E.g. Prompt: write a poem where the last sentence is the first sentence but with the words in reverse order