ML and Gen AI Refresh Flashcards

1
Q

What kind of DB does rag use?

A

Vector database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a vector database

A

It is a database that is designed to index, store, and query data in a vector format (e.g. an n-dimensional vector embedding).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does a vector database work?

A

Works by querying for k-vectors that are closest to other vectors in terms of a distance metric like cosine similarity, dot product etc. Instead of using K-exact nearest neighbors, we typically use ANN approx nearest neighbors. It diminishes recall (e.g. drops some documents that might in fact be similar FN) but is more performant from a computational perspective.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why I would want to optimize recall!

A

We optimize recall if we do not want many false negatives. E.g. me telling someone they don’t have an STD but they do.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why would I want to optimize precision?

A

We optimize precision if we want to ensure that we don’t get a lot of false positives–e.g. telling someone they have cancer but they don’t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some issues with vector queries

A

Not many great algorithms for efficent knn queries to gurantee finding the k-nearest neighbor to a given vector. Hence why we typically opt for ANN, which drops in accuracy, but is efficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is zero-shot prompting?

A

Zero-shot prompting can be thought of like your asking someone to solve a problem with no context and hoping they get the right answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is few-shot prompting?

A

Few-shot prompting can be thought of like your adding some context (examples) to help the person solve the problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some limitations of few-shot prompting?

A

They’re not great at dealing with complex reasoning tasks. In this case we need more of a chatbot like structure in our response, such as Chain of Thought (CoT) prompting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is CoT Prompting?

A

CoT Prompting is like a Q-A-Q-A answering technique that aims to get to the right answer by breaking out the reasoning into steps. For example you could ask a simple math problem, get the answer back, and then ask a different question that is somewhat an expansion on the first and get back the correct response compared to asking that question upfront.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does CoT come into play with zero-shot?

A

You can add in zero-shot prompts with CoT by simply adding “Let’s think step by step” in the prompt.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are chains in LangChain

A

They are a sequence of calls, whether to an LLM, tool, or a data preprocessing step.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we use chains?

A

Chains allow you to go beyond just a single API call to a language model and instead chain together multiple calls in a logical sequence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Give me an example of the input for chains.

A

prompt and model (llm)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give me an example of the input for chain.run

A

query, text where query is the base prompt and text is what will be going into the chain.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the main architecture powering Foundational Models

A

Transformer Architecture. Essentially it just provides the ability to perform parallel training of gigantic neural networks with billions of parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is an encoder-only architecture?

A

BERT is an example of this. Essentially it only contains the encoder piece and transforms the text into it’s vector representation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain to me the transformer architecture

A

Reference: https://blue-season.github.io/transformer-in-5-minutes/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a decoder-only architecture?

A

GPT-3. Contains only the decoder. They extend input text sequence by generating continuations. Text completion and generation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is an encoder-decoder architecture?

A

Contains both. Decoder consumes the encoded embeddings to generate output text. This can be used for things like text-to-text e.g. translation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What makes Foundational Models different?

A

Scale, architecture, pretraining, customization, versatility, infrastructure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are some different types of FMs

A

Language, Computer Vision, Generative model, Multimodal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Walk me through a basic RAG architecture

A

Let’s break out what is in RAG. On an extreme high level, rag consists of three important things:
1. The question (user)
2. The external knowledge database (the library of current and or relevant knowledge)
3. The retriever (the librarian tasked to get documents related to that task and return it to better help the user answer their question)
4. A really smart, but also out of date or touch model (the LLM)

So here’s what happens:
1. The librarian takes the users question and finds similar documents that are relevant to that question from the library.
2. Those similar documents are then added to the users questions to help the LLM answer that question way more correctly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What should all RAGS integrate? NNIC

A

Counter Factual Robustness + Noise Robustness
* Ability to handle noisy or irrelevant data contained in the retrieved documents

  • Ability to be like hey, these documents I pulled are totally irrelevant to this question the user is asking. E.g. I want to know how to make hot chocolate, here are some documents about how to bake a cake :(.

Negative Rejection
* Reject the answer if it lacks sufficient knowledge (I.E. the LLM gave a crap answer and OR our database returned crap documents because it has nothing related to say a very nuanced question on how to formalize a university grade class on underwater basket weaving.

Information Integration
* Ability to integrate information from multiple sources to answer more complex questions. E.g. think of our library not centralized to just english, but science, music, and dare I say information even about yourself!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are some quality scores that should be used to assess our RAG?

A

Happy that you asked. Think of it like a little calf…or CAF.

I. Context Relevance
* Retrieved context NEEDS to be relevant for answering the users question
II.Answer Relevance
* The answer has to directly answer users the question. I.e. no “I want to bake a cake” and the output ends up being “Top 10 hot sexy things to do in Austin this Tuesday”
III. Faithfulness
* The answer must be faithful to the context retrieved. I.e. we ask what planet has the most number of moons–and we provide literally the exact answer to it and BOOM get the wrong answer still. Wamp wamp.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What are the two primary MUST haves of a RAG

A

Good retrievers make good answers. The answer-Generation must be able to make good use of the documents.
The retriever-Retrievr must be able to find the most relevant documents for answering the users question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Name some ways we can address the retrieval issue with RAGs?

A
  1. Chunk Size Optimization
    Chunking too small or too large may result in inaccurate answers
  2. Structured Knowledge
    Enables recursive retrievals and query routing
  3. Sliding Window Chunking
    Overlapping chunks to help alleviate long documents.
  4. Metadata Attachments
    Enables more efficient search via filtering, like on keywords!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is chunking in RAGs?

A

Chunking in Large Language Model (LLM) applications breaks down extensive texts into smaller, manageable segments.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is Sliding Window Chunking as it relates to RAGs?

A

In this method, chunks have some overlap, ensuring that information at the edges of chunks is not lost. Sliding window chunking provides a balance between fixed-length and context-aware techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Name some ways a RAG can address the good answer generation issue!

A

Information Compression
* Reduces noise and helps alleviate context-window constraint

Generator Fine-Tuning
* Fine-tine LLM to help ensure retrieved docs are aligned to LLM
Result Re-Rank
* the process of reordering an initial list of retrieved documents or passages to improve the ranking quality.
* Alleviates lost in the middle phenomena in LLMs
*

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is Knowledge Distillation composed of?

A

The Teacher Network: The big power hungry sensei that contains all this vast knowledge

The Student Network: The eager young grashopper–the faster lighter weight model we are going to train.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is Lost in the Middle problem for LLMs

A

It’s where LLMs really underperform on certain tasks when relevant information is in the middle of the prompt. The more we expand the size of the prompt the information in the middle gets lost.

Just like humans we respond well to information at the beginning or end of a piece of content. Information in the middle tends to get lost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the context window?

A

The context window is the number of tokens the LLM can process at once in the input prompt.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Why do we use logistic regression?

A

Logistic regression is used if we want to calculate something discrete, like whether people like Troll 2.

It fits squiggle lineto data that tells us the predicted probability for discrete variables on the y axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What are the assumptions of linear regression? LIHN

A

Linearity: Relationship between predictors and the response is linear.
Independence: All observations are independent of each other. E.g. feature 1 does not influence feature 2
Homoscedasticity: The spread of the residuals is constant across all levels of the predictors. Or the variance of our error terms are similar across the independent variables
Normality of Residuals: The residuals should follow a normal distribution. The residuals should form a bell curve when you plot them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How would you explain the Bias v. Variance Tradeoff to a high school student!

A

OK KIDS! Let’s talk about with Bias v. Variance Trade Off is. To understand this we have to understand what Bias and Variance even is in relation to a model–so let’s hop in!

Bias: Bias is actually exactly what it sounds like, BIAS! For example lets say we are using a model and we make some “assumptions” about the data. Let’s say, in our case, we assume everyone at all weight is going to be a height of 3 inches. We absolutely 100% REFUSE to believe otherwise, well that’s BIAS! Now, we say in statistics that a model that is often overly bias is underfitting the data. And when we think of something “fitting” the data we think of how well our model, or in this case a line matches up all the points with all points lying on the line being a perfect fit. We see when we draw the line that all x values = height of 3 inches we are essentially drawing a complete horizontal line and missing all the data points! This is called underfitting.

Variance: Now variance is exactly related to what it sounds like, how much things vary! Imagine the model now is so worried about being too biased that it now wants to ensure that every possible solution and miniscule consideration is taken into account. Therefore for all weights down into the ounce level we try to make sure that we get the exact possible height—say for example on average in the real world the height is 3 inches for people between weights 1-2 ounces. When we have a lot of variance, we’re adding in all these possible solutions of heights for a ton of different values between 1-2 ounces. What happens is we start to lose sight of the “trend” in the answer, and get to focused on the “exact” answer and get lost in the weeds. A way to see this is imagine drawing a squiggly line for all data points. What this does is add a ton of COMPLEXITY into our model, meaning just that…it’s super complex and gets lost in the weeds of the overall answer. When we have a highly variant or overly complex model we often worry about the model overfitting the data–meaning that when the model sees new data it may not be able to get the correct answer because it was soooo focused on the data it saw before.

So this is all great, but what is the Bias vs. Variance Tradeoff?? Well, you actually already have the pieces to know what it is! The Bias vs Variance tradeoff is ensuring that the model we make is not too bias (leading to underfitting the data) and does not have too much variance (overfitting the data) so that we can find a sweet spot of a model who has correct assumptions of the data, but isn’t so hyperfocused on being “overly” correct that it loses sight on the broader trend at large. You can think of it as bias is being so concerned with one particular right answer and it’s wrong, and variance being obsessed with all possible right answers and it’s wrong.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is Regularization?

A

Think of Regularizations like a spanker ready to punish your model when it does something as horrible as overfitting the data (gasp, bad model)!

Regularization spanks your model into shape by adding a penalty on the size of coefficients for your model. There are two types (L1,L2,Elasticnet etc) and each have their own way of SPANKING or well shrinking the coefficents towards zero to simplify your overly complex complicado model that loves to overfit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

When should I use regularization

A

Use regularization when you think your model is being naughty and overfitting the training data. E.g. captures noise instead of the underlying pattern (bad model!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is Cross Validation

A

Cross-validation is a technique to assess how the results of a statistical analysis will generalize to the data. It is used to estimate the skill of the model on new data, tuning model parameters, and choose between models.

40
Q

When should you use cross-validation.

A

To evaluate the performance of your model in a robust way than simply splitting the data into a single train and test set and praying to the math gods that that was good enough. What makes cross-validation fold sexy is that it allows for multiple rounds of training and validation on different subsets of the data.

41
Q

When would you use regularization vs cross-validation?

A

Regularization - To spank that model into shape when it’s being a bad boi and overfitting.

Cross-Validation - To assess how well your regularization or other model choices is likely to perform on unseen data WITH the added sexy feature of splitting up the data into multiple folds through a sexy iterative process of splitting up the data

42
Q

Which model would be better to predict booking prices on Airbnb: linear regression or random forest regression?

A

OK first of all let’s be real here–linear regression is basic, and what’s the point of trying to solve this problem anyway? Well it’s to see if you know that one model provides something better and sexier than the other model.

Now let’s think about what might go into Air Bnb booking prices…(location, amenities, reviews) oh my god it’s so many my butt hurts. If we approach trying to solve the problem with Linear Regression be my guest, but it’s going to assume that the relationship is…well linear. And guess what? There are factors (features) that we are looking at they are not going to play well linearly.

Then there’s random forest (oooooo) it’s like a swiss army knife, and capable of capturing linear and NON linear interactions between features without you specifying them. So considering the complexity, Random Forest Regression is probably better.

OMG THAT WAS TOO MUCH INFO GIVE ME A TWO SENTENCE ANSWER.

I would use Random Forest Regression because odds are relationships between the variables may be non-linear and even capture complex interactions between predictions.

43
Q

What is gradient descent?

A
44
Q

How does knowledge distillation work?

A
  1. Knowledge Extraction: Think of the student capturing not just the teachers final learnings, but the intermediate activations and hidden representations from the teacher.
  2. Student Learns two primary things: The original data the teacher was trained on AND the knowledge extracted from the teacher intermediate activations and hidden representations
  3. Graduation: The student learns the essence of the teacher’s knowledge. It may not be a smoothe all knowing samurai yet, but it’s cheaper, faster, and simpler
45
Q

In Knowledge Distillation what are the intermediate activations and hidden representations the student learns from the teacher who detects say images?

A

Intermediate Activations = The outputs of hidden layers. One layer might detect edges, the other might detect shapes and so on. This helps the student breakdown meaningful features from the image, just like the teacher did.

Hidden Representations: These are the intermediate activations within the teacher network. Like the teachers unspoken thoughts and insights of the image. The teacher will never match these completely, but by mimicing the teachers internal activations it learns similar representations.

46
Q

What are internal activations in a NN

A

The tiny decision makers: E.g. think of it as one layer detecting edges, the next detecting shapes
It is the outputs of individual neurons within hidden layers of the network, representing their activation level. In dry terms–it indicates its response to the the weighted sum of its inputs and any added bias.

47
Q

What are hidden representations?

A

Instead of internal activations representing single layers, think of hidden representations as this summary of all the relationships between individual neuron activations across all the layers.

48
Q

What’s the difference between hidden representations and internal activations

A

Scope: Internal activations are specific to individual neurons, while hidden representations are collective understandings of an entire layer.
Level of abstraction: Internal activations are raw outputs, while hidden representations are more abstract interpretations.
Accessibility: Internal activations are temporary and not stored, while hidden representations can be extracted and analyzed in some tasks.

49
Q

Does knowledge distillation mean smaller embeddings?

A

Nope, not always! It could lead to loss of performance so it’s important to assess the efficiency v performance.

50
Q

How do LLM like chat gpt work on a high level in one sentence

A

N- to one: They take n tokens as input and produce one token as output

51
Q

Difference between sequential models and how they process words and Transformers model?

A

Sequential = Process word by word sequentially (oh right)
Transformers = Are like an attentive high emotional IQ boyfriend who has this super power of considering all your words simultaneously, and paying ATTENTION to the one that matter most for the current word being processed (awww!)

52
Q

What is comprised in a transformer architecture?

A
  1. Input and Embeddings: Sentence gets transformed into a vector representation—representing the meaning of each word.
  2. Positional Encoding: Words have order–but the transformers architecture does not know that yet–so we add positional encodings that tell the model the position of each word in the sentence.
  3. Attentioooonnn!!!!: Self-attention is used where each word pays attention to all other words in the sentence. Each word has a spotlight and that spotlight is going to shine the brightest on the most relevant words for understanding. = Helps the model capture contextual meaning
  4. Multi-head Attention: Obviously one spotlight is not enough, so transformers have multi-head attention like different spotlights with different filters, where each captures different aspects of the relationships between words. =Helps the model learn multiple levels of representation
  5. Encoder and decoder (if applicable): Depending on the text we might have an encoder that processes the input sentence using these attention layers, followed by a decoder that generates the output using attention to both the enco
    ders output and
    its own generated words
  6. Output: Finally leads to a desired output gathered through the attention to generate e.g. translation, summary, etc
53
Q

Name some benefits of Transformers compared to Sequential Models

A

Short answer:
1. Faster than sequential: due to PP parallel processing of words
2. Long-Range Dependencies: Captures relationships between words far apart in the sentence.
3. Better contextual awareness.
4. The architecture is more flexible for different tasks.

THE PP (Parallel Processing): Allows for parallel processing of all words compared to sequential where its word by word—e.g. makes em fast bois

Long-Range Dependencies - Capture relationships between words FAR apart in the sentence, something traditional models struggle with. E.g. Although the farm lives alone, he was never lonely because his dog. LRD alone and because and never lonely offer hints.

Flexible for a lot of tasks: The architecture is easy to adapt for certain tasks like q-a, sentiment, entity recognition etc.

53
Q

What is a token in LLM context?

A

They’re words or chunks of characters like
ham, bur etc. Rule of thumb is 1 token = 4 chraracters or 1 token = .75 words for english text

53
Q

Why did we switch to Transformers and not just LSTM based architectures?

A

Primarily Transformers are faster, AND are better at understanding long range dependencies. Imagine the detective struggling to connect a hidden clue at the beginning to a seemingly unrelated detail at the end. Transformers, with their global attention, can easily spot these distant connections.

54
Q

Can logistic regression predict a continuous variable?

A

No. Logistic regression is designed for classification and it achieves that by squashing the output space into a binary one.

55
Q

Can a logistic regression use a continuous var as a predictor

A

Yes

56
Q

What is the difference between self attention and attention in a transformers architecture?

A
57
Q

Explain an LLM to a more technical person

A
58
Q

Explain an LLM to a non-technical person

A

LLM are extremely literal robotic improv artists who grew up on the internet. Here me out.

Imaging you’re at an improv troupe comprised of two members :

A human
An LLM

  1. The user prompt
    The improv artist asks for a word of recommendation, or a couple of words. You, the audience, the user, decide to say “APPLES!”
    or maybe you even decide to say “Apples!!! I have one for breakfast everyday.” or even “APPLES! OH MY GOD I CANNOT LIVE MY LIFE WITHOUT APPLES EVERYDAY I WILL DIE.”
  2. The tokens
    Now, what you just gave the human on stage are words, and what you gave the LLM are tokens. Longer suggestions/context have more words therefore more tokens and shorter responses have less tokens. Bonus here is that in the english language, about 4 characters equate to one token, so let’s just stick with the improv artist and the llm.
  3. What the LLM and the Human do with the tokens: Word Association Game

Now let’s say for the sake of this example the improv artist goes OK we have our word of suggrestion when the lights come on “APPLES”. Which just means, you’re about to watch a show about…welp apples. BUT HOW DO IMPROV ARTISTS DO THIS???

For the the human: They’re going to create a show based off of word associations to Apples.

For the LLM: They’re going to create a show based off of word associations to Apples.

….
Just how they got there in their associations is a bit different, yet similar (and some people like to argue this….but that’s not a discussion for now)

The human is building word associations that they have learned throughout their life in that particular language. They may be using personal experience, stories they’ve heard from yesteryear–whatever, but they aren’t making up stories about whales–THE SHOW IS ABOUT APPLES.

The LLM is building word associations that they also have learned throughout their life (in their training phase). They learned these probabilities based off things like words they “saw” on the internet, and gathered patterns. From these patterns they predict probabilities of each possible next word related to apples, and pick the most likely one to do their “bit”. This keeps on going, with the LLM using it’s own predictions as prompts for the next word, until it reaches a final length.
This is something called n-tokens in 1 token in what’s called an expanding window pattern.

  1. Writing good prompts and ensuring a fun show: The beauty of context

NOW I know you didn’t ask for this, but in case you were curious about how to make better prompts—context is key here (actually for both human and llm improv artists, BUT ESPECIALLY THE IMPROV ARTIST).

You see, if the improv troupe just goes with the word apples, you might end up having a not so fun show. Why?

For the improv artist: They’re now forced to come up with a show about apples, and while it can deviate from that, it doesn’t provide the individual much material to work with as much as say, the unhinged response which mentioned they would die without apples. One big difference between the LLM and human here those is this: The human still knows they’re doing a show because you and I as humans can pick up on unsaid context.

And for the LLM? Well…they don’t know they’re at an improv show with just the word “Apples”

If we give the LLM just the words “Apple” they’re going to create a bit about…well something that ultimately will sound like an encyclopedia reference to apples, the kinds of them etc…. it won’t be very funny.

So next time you’re at an improv show or inputting a prompt to a LLM, make sure you’re adding context to experience a better show.

59
Q

What’s the different between self attention and attention layers in the transformer architecture?

A

Self-attention and Attention are both mechanisms that allow transformer models to attend to different parts of the input or output sequences when making predictions.

Attention refers to the ability of a transformer model to attend to different parts of related sequences when making predictions. This is often used in encoder-decoder architectures, where the encoder vectorizes the input sequence, and the decoder attends to the encoded representation of the whole input when making predictions. For example, in a language translation, attention models the relationship between the original and translated text.

Self-attention, on the other hand, refers to the ability of a transformer model to attend to different parts of the input sequence when making predictions. The name comes from the fact that contrary to “regular” attention, self-attention refers to the same sequence which is currently being encoded. This allows us to look at the whole context of our sequence while encoding each of the input elements.

With the superpower of Attention, our robot can listen to one person speaking in English, translate it into French, and then make sure the translation makes sense by checking back with the entire conversation. It’s like having a conversation between two languages and making sure nothing gets lost in translation.

Now, switch gears to Self-Attention. This is when our robot focuses on just one person’s story, understanding every detail by considering the whole story as it listens. It’s like the robot is making a map of the story, seeing how each word connects to the others, making sure it gets the full picture.

So, in the world of transformers, Attention helps our robot juggle between two languages or parts of a conversation, while Self-Attention helps it deep dive into one story to catch every nuance. Bam! That’s how our transformer robot keeps up with the chatter, making sure it understands everything, whether it’s translating or just listening in.

60
Q

What are three common alignment problems in LLM

A

The alignment problem in Large Language Models typically manifests as:

Lack of helpfulness: when the model is not following the user’s explicit instructions.
Hallucinations: when the model is making up unexisting or wrong facts.
Lack of interpretability: when it is difficult for humans to understand how the model arrived at a particular decision or prediction.
Generating biased or toxic output: when a language model that is trained on biased/toxic data may reproduce that in its output, even if it was not explicitly instructed to do so.

61
Q

What is NER

A

NER is also called entity identification, entity chunking or entity extraction.

Answer
NER is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person name, organizations, locations, medical codes, monetary values, etc.

62
Q

What are some benefits for using tf instead of tfidf

A

Answer
When TF is used, the frequency of the words appearing in the corpus will be used, which will be mostly populated with stop words such as “the”, “a”, “and”, etc. So, most of the output will be skewed towards the stop words.
When TF-IDF is used, the numerical output is such that the output is not affected by the most common words in the corpus. The way that this is done is that the weights of the common words are diminished down, and the weights of the uncommon words are scaled up.

63
Q

How to mitigate bias in LLM.

A

PIPF. Bias can be boiled down to two things.

Pre-processing Stage: Data is curated and modified to remove or reduce sources of, ensure fairness etc.

In-processing (Fine-tune) Stage: Design or implementation is modified and or optimized to counteract the bias during LLM learning process via adversarial learning, regularization, debiasing loss, or fairness constraints that can discourage or penalize LLM from learning biased representations. The in-processing stage can help to reduce model bias in LLM outputs, which is the bias that stems from the design or implementation of the LLM.

Post-processing stage: Outputs generated by the LLM are modified or improved to correct or compensate for the bias after the LLM generation process. Output filtering, output rewriting, output ranking, etc that can detect and remove or reduce the bias from the LLM outputs. The post-processing stage can help to reduce decoding bias in LLM outputs, which is the bias that stems from the algorithm or technique used to generate text from the LLM

Feedback stage: This is the stage where human interaction or intervention with the LLM is involved to monitor, evaluate, or intervene in the no code AI development and deployment process. This can involve techniques such as human feedback, human evaluation, human oversight, or human collaboration that can identify and address the bias issues in the LLM outputs. The feedback stage can help to reduce feedback bias in LLM outputs, which is the bias that stems from human interaction or intervention with the LLM.
HIL (Human in the loop) can be used to help mitigate bias.

> Training data curation to ensure the data is diverse
Model fine-tuning–provide feedback on model outputs
Customization and control-Human can adjust the output by tailoring towards specific domains.

64
Q

How does bias happen in LLM

A

Three Reasons TAU:
1. Training data
2. Algorithms - For example an algorithm places more importance on certain features or data points, it may unintentionally introduce or amplify biases present in the data.
3. Use case: If a LLM is designed to generate content for a certain demographic or industry, it may inadverntly reinforce existing biases and exclude different perspective

65
Q

Name strategies to mitigate bias in each stage of the LLM process for preprocess stage

A

Data Augmentation: Introduce additional diverse and balanced examples to counteract biases present in the training data.
Data Filtering: Remove or down-sample data that contains explicit biases or skewed representations.
Synthetic Data Generation: Create synthetic examples that promote fair representations of underrepresented groups, helping the model learn more equitable patterns.

66
Q

Name strategies to mitigate bias in each stage of the LLM process for in-preprocessing stage

A

In-processing StageIn-processing strategies involve modifying the training process itself to encourage fairness and reduce bias:
Bias-Aware Loss Functions: Modify the loss function to penalize biased predictions, incentivizing the model to produce more neutral outputs.
Regularization: Apply regularization techniques that discourage the model from learning associations that lead to biased predictions.
Adversarial Training: Train an auxiliary model to identify and counteract bias, encouraging the main model to generate less biased outputs.

67
Q

Name strategies to mitigate bias post processing stage

A

Post-processing StagePost-processing strategies involve refining model outputs after they are generated:
Re-ranking: Rank generated outputs based on bias-reduction criteria, promoting less biased responses.
Bias Correction: Identify and replace biased language or associations in generated text using predefined guidelines.
Rewriting: Automatically rewrite biased sentences to be more neutral and inclusive.

68
Q

Name strategies to mitigate bias feedback stage

A

Human-in-the-Loop: Involve human reviewers to review and correct biased outputs during the model’s fine-tuning phase.
Continuous Monitoring: Continuously track and analyze model outputs in real-world applications to identify and rectify any new biases that may emerge.
User Customization: Allow users to customize the model’s behavior in terms of bias reduction, striking a balance between user preferences and ethical considerations.
Diverse Stakeholder Involvement: Collaborate with diverse stakeholders, including ethicists, linguists, and impacted communities, to ensure the de-biasing process aligns with a wide range of perspectives.

69
Q

What is an auxilary model

A

As mentioned earlier, an auxiliary model isn’t a separate model itself but a technique used in training other models. Here’s an example to illustrate this technique:

Scenario: Imagine you are training a model to classify handwritten digits (0-9) from images. This is the main task.

Using Auxiliary Tasks:

Color Recognition: As an auxiliary task, you could train the model to predict the dominant color of the image alongside recognizing the digit. This helps the model learn features related to color variations, which might be subtle but still relevant for distinguishing certain digits (e.g., differentiating a pale 2 from a darker 7).

Line Detection: Another auxiliary task could be to have the model identify lines or edges within the image. This helps the model understand the overall structure and shape of the digit, which is crucial for differentiating similar-looking digits (e.g., differentiating a closed 0 from an open 6).

Benefits in this example:

By learning these additional tasks (color and line recognition), the main model (digit classification) gets a better understanding of the data (handwritten images). This can lead to:

Improved accuracy: The model might be able to differentiate similar-looking digits more accurately with the additional information gained from the auxiliary tasks.
Data efficiency: If training data for handwritten digits is limited, the auxiliary tasks can help the model learn better even with less data specifically for digit classification.

70
Q

What is transfer learning?

A

Transfer learning is a machine learning technique where you take a model trained on one task and use it as a starting point for a model on a different but related task. Here’s a breakdown to make it easy to understand:

Think of it like recycling knowledge:

The original model: Imagine you’ve spent a lot of time learning how to ride a bicycle. You’ve gained valuable skills like balance, coordination, and understanding how to navigate.
A new task: Now, let’s say you want to learn how to ride a motorcycle. Transfer learning is like taking the knowledge you already have from bike riding and applying it to help you learn the motorcycle more quickly.

How it works in machine learning:

Pre-trained model: You start with a model that has already been trained on a large dataset for a task (let’s say, recognizing different types of objects in images).
New but related task: You want your model to do something similar, but not exactly the same (let’s say, recognizing specific types of birds).
Adapting the knowledge: Instead of training a brand new model from scratch, you start with the pre-trained model and fine-tune it. You might freeze some layers (keeping the general knowledge) and train new layers to learn the specifics of birds.

71
Q

Diff between auxilary and transfer learning

A

Key Differences:

Focus:

Transfer learning: Takes knowledge from a previous task and applies it to a new, related task. The focus is on reusing pre-trained knowledge.
Auxiliary models: Focuses on introducing additional, often simpler tasks during the training of a single model to improve its performance on the main task.
Knowledge Source:

Transfer learning: Takes knowledge from an entirely separate, pre-trained model.
Auxiliary models: Learns the additional tasks simultaneously as part of the main model’s training.
Timeline:

Transfer learning: There’s usually a separation between the training of the original model and when you begin applying its knowledge to the new task.
Auxiliary models: The main and auxiliary tasks are trained together within the same model.
Similarities:

Both can improve model performance and data efficiency.
Both involve the idea of learning multiple things to gain a better understanding overall.
To sum up, think of it this way:

Transfer learning is like bringing in an expert consultant with experience in a related field to help you solve a new problem.
Auxiliary models are like providing a student with extra practice problems or study materials along with the main topic to improve their overall understanding.

72
Q

What is a classic example of transfer learning that you have used in the past

A

Using a pretrained model like BERT and adapting it to my task like labeling sentiment.

  • Adding a new Output Layer: You add a layer on top of the pre-trained model designed to predict sentiment (positive, negative, neutral, etc.).
  • Training on Your Data: You feed your labeled sentiment data to adjust the weights in the model, specializing it to the task of understanding sentiment.
73
Q

Why user a transformer based Architecture compared to LSTM based architecture?

A

LSTM had an encoder decoder architecture

Encoder- Creates vector rep of words
Decoder - Returns a sequence of words from the vector rep

LSTM needed the inputs from the previous state to make any operations on the current state to take into account the interdependence of words.

Transformers-Maintain interdependence of the words without a RNN by using an attention mechanism. The attention measures how closely two elements of two sequences are related. This is applied to a single sequence (also known as a self-attention layer). This feature makes it must faster to train on. and LSTM are sequential therefore will not be as fast. Additionally transformers can have contextual embeddings, which draw information from context to correct missing or noisy data.

74
Q

What is the self attention layer and how does it make it faster to train?

A

The self-attention layer determines the interdepennce of different words in the same sequence, to associate a relevant representation with it.

Example:
“The dog didnt cross the street because it was too tired.” To us it is the dog and not the street. The objective of self-attention process will be to detect the link between dog and it. This feature makes transformers must faster to treain compared to the other models.

75
Q

Can you differentiate between a traditional language model and a Large Language Model (LLM) like me? Explain the key differences in their capabilities.

A

Processing speed: Traditional models, like LSTMs, struggled with parallel processing, leading to slower training and execution.
Contextual awareness: LLMs, with their Transformer architecture and self-attention mechanism, excel at understanding the relationships between words, providing them with superior contextual awareness.
Data volume: LLMs are trained on massive amounts of data, making them more robust and knowledgeable compared to their less data-hungry predecessors.
Generalizability: LLMs, due to their architecture and training methods, are often more adaptable and can perform various NLP tasks, unlike traditional models often limited to specific functions.

Provide data for training Generative AI models: Analyzing vast amounts of text data using NLP techniques helps identify patterns and relationships in language, which are then used to train Generative AI models to produce human-quality text, speech, or code.
Guide and shape the output of Generative AI models: By understanding the context and intent behind user input or prompts, NLP can refine the direction and style of the text generated by AI, ensuring it’s relevant and coherent.
Evaluate the performance of Generative AI models: NLP techniques can be used to assess the quality and effectiveness of the outputs generated by AI models, helping to identify areas for improvement and tailoring them to specific user scenarios.

76
Q
  1. Imagine you’re working with a salesperson on a deal with a customer who wants to use Generative AI to create marketing materials tailored to specific customer segments. What key questions would you ask the customer to understand their needs and propose an appropriate solution?
A
77
Q

STAR

A

Situation: Set the scene
Task: Describe the purpose
Action: Explain what you did
Share: Share the outcome

78
Q

Consideration-Design-Build-Deploy-Demonstrate Value to Customer - Scale

A

Consideration: (Functional and non-functional requirements)
Problem, Things to Consider, Success metrics and KPIS
Design: Model choice, System architecture: Data pipeline, Offline training, Online predictions, Metrics and Monitors, MLOPs
Build: CI/CD best practices, Google Cloud Vertex AI for model development and deployment
Develop APIs to integrate the solution
Implement unit tests:
Deploy:
Testing and monitoring, what do we want to monitor, including infra health, model performance, bias etc.
Demonstrate value to customer:
* Track key metrics overtime and or compare against benchmarks
* Gather feedback etc
Scale:
Horizontal scaling: Add additional computing resources to handle increased load
Model retraining/drift detection:
More advanced algorithms
Auto scaling on online endpoint node

79
Q

Horizontal Scaling in Recommendation System

A

Concept: Add more machines (nodes) to the system to distribute the workload and increase processing power.
Example: As data and user base grow, performance bottlenecks may occur. Horizontal scaling can be implemented by adding more instances of the model running on seperate machines (like autoscaling). This increases the sys capactiy to handle more users and data without compromising responsiveness.

Vertical Scaling:
Concept: Upgrade existing hardware CPU and memory of machine(s) running in the system
Example

Autoscaling is like horizontal scaling:
* Add more machines or resources during peak periods of high traffic. During low periods it can scale down. This dynamic scaling ensures the system can efficently handle dynamic workloads and be more cost effective!

80
Q

Sys design questions:

A

Consideration:
Functional requirements, non-fun requirements
Success metrics/KPIs
Design: model choice, services, ci/cd etc
Build: Services used what were actually building
Deploy: What resources we’re using, monitoring, model drift etc
Value: Demonstrate value to customer, customer feedback, benchmarks, kpis etc
Scale: Horizontal, Vertical, Auto

81
Q

Imagine you’re working with a salesperson on a deal with a customer who wants to use Generative AI to create marketing materials tailored to specific customer segments. What key questions would you ask the customer to understand their needs and propose an appropriate solution?

A

Understanding the Objective:

Marketing Goals: What specific goals do they hope to achieve with these tailored marketing materials? (e.g., brand awareness, lead generation, increased engagement)
Target Audience: Who are the specific customer segments they want to reach? (e.g., demographics, interests, pain points)
Success Metrics: How will they measure the effectiveness of the generated marketing materials? (e.g., click-through rates, conversion rates, social media engagement)

Content and Design Preferences:

Desired Formats: What formats do they envision for the marketing materials? (e.g., social media posts, blog articles, email newsletters)
Brand Voice and Tone: What is their desired brand voice and tone for each customer segment? (e.g., formal, informal, humorous, informative)
Existing Assets: Do they have any existing marketing materials that have performed well that could be used as inspiration or training data for the LLM?

Teechnical Considerations:

Data Availability: Do they have access to relevant customer data or insights for each segment that can be used to inform the LLM?
Compliance Requirements: Are there any specific compliance requirements or regulations they need to consider when generating marketing materials?

Additionally, you might inquire about their budget and timeline for the project.

Additionally, you might inquire about their budget and timeline for the project.

82
Q

What simple approach can you use for the initial questions?

A

Understand the objective:
Marketing goals, target audience, benchmarks and kpis, timeline, and budget

Content and Design Preferences:
Designed formats, single report or chatbot? Will marketing materials contain images or only posts, which platform? Existing assets?

Technical Considerations: Data availability, Compliance requirements (IP, PII), Performance metrics, ethical considerations

83
Q

What is the attention layer?

A

Words compared to other words.

The attention layer is like whispering with your buddies getting a scoop at a party.

It takes each word and compares it to all the others that seem relevant based on a weight.

By the end, each word has a better understanding of the whole party because they have important connections between others

84
Q

What is the self-attention layer?

A

Words pay attention to the other words to figure out how they relate to eachother. I.e. ok king is important, but jester is definitely “connected” to king

85
Q

Least-to-Most Prompting

A

HL:
Answer this question to obtain the REAL question I want answered.

Give LLM a roadmap–e.g. you give it a starter question like:

Prompt1:
What is the capital of France?

And based on the answer to that we then ask what continent is France located?

86
Q

Chain of Thought Prompting

A

Similar to least-to-most, but instead of just showing the LLM the steps, you also include how to think through each step.

I see a ball, it’s next to the couch, go get it and bring it back!

Prompt:
I need to find the distance between New York City and Los Angeles.
1. First, I need to find the latitude and longitude of each city.
2. Then, I can use the distance formula to calculate the distance between those coordinates.

87
Q

Self-Ask Prompting

A

Give it a starting point, and then it asks itself questions to figure out the answer.

E.g. How do I get that ball? And let the LLM come up with steps on its own.

88
Q

ReAct

A

This prompt combines reasoning + taking action. The LLM can figure things out and then use that knowledge to do something in the real world.

Prompt:
A customer is looking for a new pair of running shoes.
1. Based on their preferences (cushioning, support, etc.), recommend some shoes from our inventory.
2. If the customer needs more information about a specific shoe, search the web for reviews and additional details.

E.g. it has access to other knowledge sources

89
Q

Meta Prompting

A

Next level. You basically teach the LLM how to learn and improve its own prompts. It’s like saying hey if you get stuck, try asking a different question or rephrasing things a bit.

Prompt:
When answering a question, try your best to be informative and comprehensive. If you’re unsure about something, rephrase the question or search for additional information before responding.

90
Q

Symbolic Reasoning and PAL (Program-Aid Language)

A

This helps LLMs understand things that aren’t just numbers, like colors and objects.

I have 3 apples, 2 oranges, and 1 banana. How many fruits do I have in total?
The LLM might convert this prompt to a program like:
fruits = {“apple”: 3, “orange”: 2, “banana”: 1}
total_fruits = sum(fruits.values())
print(total_fruits)

91
Q

Iterative Prompting

A

Basically you keep feeding it information until it has a clear picture of what you actually want based on the output.

You are writing a news article about the discovery of a new planet.
1. Start with a headline that grabs attention.
2. Briefly describe the planet’s location and size.
3. Is there any information about the planet’s atmosphere or potential for life?
4. Include a quote from the scientist who made the discovery.

92
Q

Sequential Prompting

A

This one is good for recommendation systems, like suggesting movies or products. It helps LLM rank thinks based on how relevant they are, kind of like saying “hey, focus on the movies I actually like, not the ones I dont.”

Prompt:
A user has watched movies A, B, and C. Based on their viewing history and movie descriptions, recommend the movie they are most likely to enjoy next (out of choices D, E, and F).

93
Q

Self-Consistency

A

It considers different ways of solving a problem and picks the most logical one.

Prompt:
Solve the following equation: 2x + 5 = 11.
The LLM might also consider alternative solutions and choose the most consistent one.

94
Q

Automatic Reasoning & Tool Use (ART):

A

Let’s LLM use external tools and resources to solve problems, like calling an API to look up a fact.

Prompt:
What is the current weather in London?
The LLM could use an external weather API to retrieve the latest data.

95
Q

Generated Knowledge Prompt

A

This is where the LLM can learn from information you give it at the moment you ask a question.

Prompt:
Write a poem about a cat. Here’s a fun fact: Did you know that cats have excellent night vision?