Starter: GenAI Flashcards
(100 cards)
What does a Temperature value of T=1 effectively correspond to in generation?
Compared to 0 or higher values.
It makes the Softmax output more default scaled across all tokens, meaning the model is more likely to sample tokens that are not the most probable compared to a T of 0 but less random than a higher temperature.
How can a data scientist control the number of tokens generated by an LLM?
Use the ‘Max Output Tokens’ parameter.
What is the purpose of using a decoder in an LLM?
The decoder is responsible for generating the output sequence based on the encoded representation of the input sequence.
What is the role of attention in an LLM?
Attention allows the decoder to focus on the most relevant parts of the input sequence when generating the output sequence. And the encoder to learn relationships between words at the conceptual level.
How does attention work in an LLM?
Attention is a mechanism that allows the decoder to attend to different parts of the input sequence based on their importance for generating the current output token. This is achieved by calculating a set of attention weights that represent the attention level for each input token. The attention weights are then used to compute a weighted sum of the input tokens, which is then used to generate the current output token.
On the encoder, it helps to learn a representation of the related concepts with different attention heads learning different relationships.
What is the impact of attention on the performance of an LLM?
Attention can significantly improve the performance of an LLM by allowing it to focus on the most relevant parts of the input sequence when generating the output sequence. This can lead to more accurate, coherent, and fluent outputs.
What are some challenges associated with attention in LLMs?
Attention can be computationally expensive, and it can also be difficult to train effectively. Additionally, attention can sometimes lead to overfitting, which can make the model less robust to new data.
What fundamental problem in sequence modeling (like RNNs) does the Transformer’s “Attention” mechanism solve?
Attention allows the model to directly weigh the relevance of any word in the input sequence when processing another word, regardless of their distance. This overcomes the difficulty RNNs had with capturing long-range dependencies effectively.
Explain the typical journey of an input word through a Transformer.
- Tokenization: Word is broken into token(s).
- Embedding: Token converted to a numerical vector.
- Positional Encoding: Information about the token’s position in the sequence is added to the embedding.
- Multi-Head Self-Attention: The model calculates attention scores between this token and all other tokens in the input to create a context-aware representation.
- Feed-Forward Network: Further processing occurs via a fully connected network to refine the representation.
- Deep representation passed from Encoder to Decoder & Decoder
- Decoder receives start of sequence token & then iteratively generates sequence until Max output token or stop token.
Why is Positional Encoding necessary in Transformers?
Because the core Self-Attention mechanism doesn’t inherently consider the order of words. Positional Encodings inject this crucial sequence information into the token representations
How does the Decoder use information from both the input (via Encoder) and its own previously generated output?
It uses Masked Self-Attention to consider the previously generated tokens and Cross-Attention to incorporate the contextual information from the Encoder’s final output representations. This combination allows it to predict the next most likely token based on both the input prompt and the output generated so far.
What is Tokenization, and why is the choice of strategy important for an LLM?
Tokenization is breaking input text into smaller units (tokens) the model processes (e.g., words, sub-words). The strategy impacts the vocabulary size, handling of rare words, and ultimately, model performance and efficiency.
Why are Embeddings a cornerstone of how LLMs process language?
Embeddings convert discrete tokens into dense numerical vectors where semantic relationships are captured geometrically (similar words have closer vectors). This allows the model to perform mathematical operations that reflect linguistic meaning.
In the GenAI Project Lifecycle, why is “Define the use case” the critical first step for a data scientist?
Clearly defining the use case guides all subsequent decisions: selecting the right foundation model (general vs. specialized), choosing the adaptation strategy (prompting vs. fine-tuning), and defining relevant evaluation metrics.
What are the three primary methods for adapting a base LLM for a specific task, post-selection?
- Prompt Engineering: Crafting effective prompts, potentially with examples (few-shot).
- Fine-tuning: Further training the model on a dataset specific to the target task.
- Aligning with Human Feedback: Using techniques like Reinforcement Learning from Human Feedback (RLHF) to steer the model towards desired behaviors (e.g., helpfulness, harmlessness). Evaluation follows adaptation.
When should a data scientist consider fine-tuning an LLM versus relying solely on prompt engineering?
Consider fine-tuning when:
a) Prompt engineering (even few-shot) doesn’t yield sufficient performance.
b) The task requires deep domain-specific knowledge adaptation.
c) You have a suitable dataset and computational resources for training. Prompting is generally faster and less resource-intensive for simpler adaptations.
What’s a key trade-off when choosing between a very large foundation model and a smaller, potentially domain-specific one?
Large models offer broad knowledge and strong zero/few-shot capabilities but are computationally expensive. Smaller models can be more efficient and potentially achieve higher performance on specific tasks (especially after fine-tuning) but lack the general world knowledge.
As a data scientist, when is few-shot prompting likely more effective than zero-shot?
Use few-shot when the task requires specific formatting, complex reasoning, or is nuanced. Providing examples guides the model’s output more precisely than instructions alone, especially beneficial for less capable models.
What does “In-context learning” (ICL) refer to in LLMs?
The model’s ability to learn how to perform a task based only on the examples provided within the prompt itself (zero-shot, single-shot, few-shot), without updating its internal weights.
How can a data scientist adjust an LLM’s output to be more deterministic/focused versus more creative/diverse?
Samples from the K most probable tokens. Higher K means more creative/less coherent.
Top P: Samples from the smallest set of tokens whose cumulative probability exceeds P. Adapts better to the shape of the probability distribution, often preferred for balancing quality and diversity.
Temperature 0 always gives same output. Temperature > 1 more than the standard creativity.
What are the potential consequences of setting the ‘Max Output Tokens’ parameter incorrectly?
The model’s response may be cut off mid-thought, incomplete, or unable to fulfill the prompt’s requirements. Too high: Can lead to unnecessary computation or overly verbose/rambling outputs if the model doesn’t naturally conclude sooner.
What does a Temperature value of T=0 effectively correspond to in generation?
It makes the Softmax output extremely sharp, essentially forcing the model to always pick the single token with the absolute highest probability, equivalent to Greedy sampling.
What do Scaling Laws tell us? What paper proved this?
They show model performance improves with increases in dataset size, model parameters, or compute, following a power-law relationship.
OpenAI’s “Scaling Laws for Neural Language Models”.