Practice Questions Flashcards
(57 cards)
A Generative Al Engineer has created a RAG application to look up answers to questions about a series of fantasy novels that are being asked on the author’s web forum. The fantasy novel texts are chunked and embedded into a vector store with metadata (page number, chapter number, book title), retrieved with the user’s query, and provided to an LLM for response generation. The Generative AI Engineer used their intuition to pick the chunking strategy and associated configurations but now wants to more methodically choose the best values.
Which TWO strategies should the Generative AI Engineer take to optimize their chunking strategy and parameters? (Choose two.)
A. Change embedding models and compare performance.
B. Add a classifier for user queries that predicts which book will best contain the answer. Use this to filter retrieval.
C. Choose an appropriate evaluation metric (such as recall or NDCG) and experiment with changes in the chunking strategy, such as splitting chunks by paragraphs or chapters. Choose the strategy that gives the best performance metric.
D. Pass known questions and best answers to an LLM and instruct the LLM to provide the best token count. Use a summary statistic (mean, median, etc.) of the best token counts to choose chunk size.
E. Create an LLM-as-a-judge metric to evaluate how well previous questions are answered by the most appropriate chunk. Optimize the chunking parameters based upon the values of the metric.
CE
- C is correct because it advocates for a systematic, metric-driven approach to chunking. By selecting an appropriate evaluation metric (like recall or NDCG) and experimenting with different chunking strategies (paragraphs vs. chapters), the engineer can quantitatively determine the best performing strategy.
- E is correct because using an LLM as a judge provides a direct measure of how well chunks answer questions. This directly assesses the suitability of question and answer pairs, allowing for targeted optimization of chunking parameters.
- A is incorrect because changing embedding models addresses the quality of the vector representations, not the chunking strategy itself. While important, it’s a separate optimization concern.
- B is incorrect because adding a classifier for books focuses on improving retrieval by filtering content, not on optimizing the chunking strategy.
- D is incorrect because relying on an LLM to suggest token counts may be limiting and not as comprehensive of an optimization strategy as actually evaluating different chunk division strategies.
A company has a typical RAG-enabled, customer-facing chatbot on its website.
Select the correct sequence of components a user’s questions will go through before the final output is returned. Use the diagram above for reference.
A. 1. embedding model, 2. vector search, 3. context-augmented prompt, 4. response-generating LLM
B. 1. context-augmented prompt, 2. vector search, 3. embedding model, 4. response-generating LLM
C. 1. response-generating LLM, 2. vector search, 3. context-augmented prompt, 4. embedding model
D. 1. response-generating LLM, 2. context-augmented prompt, 3. vector search, 4. embedding model
**
**
A. The correct sequence is embedding model, vector search, context-augmented prompt, and finally response-generating LLM. First, the user’s question needs to be converted into a vector embedding. This embedding is then used to search a vector database for relevant context. The relevant context is combined with the original query to create a context-augmented prompt, which is then fed into the LLM to generate the response.
- Why A is correct: This follows the standard RAG pipeline. The question is embedded, relevant context is retrieved, a prompt is constructed, and a response is generated.
- Why B is incorrect: The embedding model needs to be the first step.
- Why C is incorrect: The LLM needs the context-augmented prompt to generate a relevant response, which is why it comes last in the sequence.
- Why D is incorrect: The LLM needs the context-augmented prompt to generate a relevant response, which is why it comes last in the sequence and embedding model needs to be the first step.
A Generative Al Engineer interfaces with an LLM with prompt/response behavior that has been trained on customer calls inquiring about product availability. The LLM is designed to output “In Stock” if the product is available or only the term “Out of Stock” if not.
Which prompt will work to allow the engineer to respond to call classification labels correctly?
A.
Respond with “In Stock” if the customer asks for a product.
B.
You will be given a customer call transcript where the customer asks about product availability. The outputs are either “In Stock” or “Out of Stock”. Format the output in JSON, for example: {“call_id”: “123”, “label”: “In Stock”}.
C.
Respond with “Out of Stock” if the customer asks for a product.
D.
You will be given a customer call transcript where the customer inquires about product availability. Respond with “In Stock” if the product is available or “Out of Stock” if not.
D. Option D provides the LLM with the necessary context (customer call transcript about product availability) and clear instructions on how to respond based on product availability, which aligns with the LLM’s design.
- Why D is correct: It directly instructs the LLM to respond with “In Stock” if available and “Out of Stock” if not, covering both possible outcomes.
- Why A is incorrect: It only specifies the response for one scenario (“In Stock”) and doesn’t cover the “Out of Stock” case.
- Why B is incorrect: While it provides a format, it does not specify when to use each output (“In Stock” or “Out of Stock”) and introduces an unasked for JSON format.
- Why C is incorrect: It only specifies the response for one scenario (“Out of Stock”) and doesn’t cover the “In Stock” case.
A Generative AI Engineer is testing a simple prompt template in LangChain using the code below, but is getting an error.
Assuming the API key was properly defined, what change does the Generative AI Engineer need to make to fix their chain?
A.
B.
C.
D.
D is the correct answer. The OpenAI()
constructor needs to be passed to the LLMChain()
constructor as llm=OpenAI()
. Options A, B, and C do not pass the OpenAI()
constructor to the LLMChain()
constructor, leading to an error.
A Generative AI Engineer is designing an LLM-powered live sports commentary platform. The platform provides real-time updates and LLM-generated analyses for any users who would like to have live summaries, rather than reading a series of potentially outdated news articles.
Which tool below will give the platform access to real-time data for generating game analyses based on the latest game scores?
A. DatabricksIQ
B. Foundation Model APIs
C. Feature Serving
D. AutoML
**
**
C. Feature Serving. Feature Serving is designed to provide features to machine learning models in real-time. This allows the platform to ingest live sports data (scores, statistics) and feed it into the LLM for generating up-to-the-minute game analyses.
- Why C is correct: Feature serving is specifically built for providing real-time data to machine learning models.
- Why A is wrong: DatabricksIQ is for model development and optimization, not real-time data provisioning.
- Why B is wrong: Foundation Model APIs provide access to pre-trained LLMs but do not handle live data integration.
- Why D is wrong: AutoML is focused on automating machine learning model development, not real-time data delivery.
When developing an LLM application, it’s crucial to ensure that the data used for training the model complies with licensing requirements to avoid legal risks.
Which action is NOT appropriate to avoid legal risks?
A. Reach out to the data curators directly before you have started using the trained model to let them know.
B. Use any available data you personally created which is completely original and you can decide what license to use.
C. Only use data explicitly labeled with an open license and ensure the license terms are followed.
D. Reach out to the data curators directly after you have started using the trained model to let them know.
D. Reaching out to data curators after you’ve already started using the trained model is not appropriate. Licensing should be verified before using the data to avoid potential legal issues.
- Why D is correct: Using data without confirming its license beforehand puts you at risk of violating licensing terms. Contacting curators after usage doesn’t mitigate the initial violation.
- Why A is incorrect: Contacting data curators before using the model is a proactive and appropriate step to ensure compliance.
- Why B is incorrect: Using your own, original data is acceptable as you control the licensing.
- Why C is incorrect: Using openly licensed data, provided you adhere to the license terms, is a valid and appropriate practice.
A Generative AI Engineer is developing a chatbot designed to assist users with insurance-related queries. The chatbot is built on a large language model (LLM) and is conversational. However, to maintain the chatbot’s focus and to comply with company policy, it must not provide responses to questions about politics. Instead, when presented with political inquiries, the chatbot should respond with a standard message:
“Sorry, I cannot answer that. I am a chatbot that can only answer questions around insurance.”
Which framework type should be implemented to solve this?
A. Safety Guardrail
B. Security Guardrail
C. Contextual Guardrail
D. Compliance Guardrail
The correct answer is A. Safety Guardrail.
- Why A is correct: Safety Guardrails are designed to ensure that a conversational AI system stays within intended boundaries, preventing it from generating unsafe or irrelevant responses, including explicitly disallowed topics like politics.
- Why B is incorrect: Security Guardrails focus on protecting the system from vulnerabilities and unauthorized access, not on content filtering.
- Why C is incorrect: Contextual Guardrails are more about keeping the conversation relevant to the current topic (insurance) but doesn’t necessarily block specific topics entirely.
- Why D is incorrect: While Compliance Guardrails might seem relevant, they are more broadly concerned with adhering to legal and regulatory requirements, rather than specific content restrictions dictated by company policy.
A Generative AI Engineer is responsible for developing a chatbot to enable their company’s internal HelpDesk Call Center team to more quickly find related tickets and provide resolution. While creating the GenAI application work breakdown tasks for this project, they realize they need to start planning which data sources (either Unity Catalog volume or Delta table) they could choose for this application. They have collected several candidate data sources for consideration:
-
call_rep_history
: a Delta table with primary keysrepresentative_id
,call_id
. This table is maintained to calculate representatives’ call resolution from fieldscall_duration
andcall start_time
. -
transcript Volume
: a Unity Catalog Volume of all recordings as a*.wav
files, but also a text transcript as*.txt
files. -
call_cust_history
: a Delta table with primary keyscustomer_id
,cal1_id
. This table is maintained to calculate how much internal customers use the HelpDesk to make sure that the charge back model is consistent with actual service use. -
call_detail
: a Delta table that includes a snapshot of all call details updated hourly. It includesroot_cause
andresolution
fields, but those fields may be empty for calls that are still active. -
maintenance_schedule
– a Delta table that includes a listing of both HelpDesk application outages as well as planned upcoming maintenance downtimes.
They need sources that could add context to best identify ticket root cause and resolution.
Which TWO sources do that? (Choose two.)
A. call_cust_history
B. maintenance_schedule
C. call_rep_history
D. call_detail
E. transcript Volume
**
**
The correct answers are D. call_detail and E. transcript Volume.
-
Why D is correct: The
call_detail
Delta table directly includesroot_cause
andresolution
fields, providing immediate insights into ticket resolution, even if some entries are incomplete. -
Why E is correct: The
transcript Volume
contains text transcripts of conversations, offering detailed information about the customer’s issue, which is invaluable for determining the root cause.
Why other options are incorrect:
- A. call_cust_history: This table focuses on customer usage of the HelpDesk, which is relevant for chargeback models but not for identifying the root cause and resolution of specific tickets.
- B. maintenance_schedule: This table is useful for understanding outages and downtime, but it doesn’t provide specific context for identifying the root cause and resolution of individual tickets.
- C. call_rep_history: This table focuses on representative performance metrics, which is not directly related to identifying ticket root causes and resolutions.
A Generative Al Engineer is creating an LLM-based application. The documents for its retriever have been chunked to a maximum of 512 tokens each. The Generative Al Engineer knows that cost and latency are more important than quality for this application. They have several context length levels to choose from.
Which will fulfill their need?
A. context length 514; smallest model is 0.44GB and embedding dimension 768
B. context length 2048: smallest model is 11GB and embedding dimension 2560
C. context length 32768: smallest model is 14GB and embedding dimension 4096
D. context length 512: smallest model is 0.13GB and embedding dimension 384
**
**
The correct answer is D.
- Why D is correct: Because cost and latency are more important than quality, the smallest model with a context length that accommodates the 512 token chunks is the best choice. Option D (context length 512, 0.13GB model) fulfills this requirement with the least resources.
-
Why other options are wrong:
- A: While the model size is relatively small, the context length of 514 is unnecessarily large, slightly increasing model size and computational cost without a significant quality gain.
- B & C: These options offer much larger context lengths and significantly larger model sizes. This leads to increased cost and latency, violating the stated priorities.
A Generative AI Engineer is designing a RAG application for answering user questions on technical regulations as they learn a new sport. What are the steps needed to build this RAG application and deploy it?
A. Ingest documents from a source –> Index the documents and saves to Vector Search –> User submits queries against an LLM –> LLM retrieves relevant documents –> Evaluate model –> LLM generates a response –> Deploy it using Model Serving
B. Ingest documents from a source –> Index the documents and save to Vector Search –> User submits queries against an LLM –> LLM retrieves relevant documents –> LLM generates a response -> Evaluate model –> Deploy it using Model Serving
C. Ingest documents from a source –> Index the documents and save to Vector Search –> Evaluate model –> Deploy it using Model Serving
D. User submits queries against an LLM –> Ingest documents from a source –> Index the documents and save to Vector Search –> LLM retrieves relevant documents –> LLM generates a response –> Evaluate model –> Deploy it using Model Serving
B. The correct sequence of steps is: Ingest documents from a source –> Index the documents and save to Vector Search –> User submits queries against an LLM –> LLM retrieves relevant documents –> LLM generates a response -> Evaluate model –> Deploy it using Model Serving
Explanation:
- Why B is correct: This option outlines the logical flow of building and deploying a RAG application. First, documents are ingested and indexed to create a searchable vector database. Then, a user query initiates the retrieval process, the LLM generates a response, the model is evaluated, and finally, the application is deployed.
- Why A is incorrect: Option A has “User submits queries against an LLM” before the LLM generates a response, which is out of order. The LLM needs to retrieve documents and then generate a response, so the query step must precede the response step.
- Why C is incorrect: Option C misses several crucial steps in the RAG pipeline, specifically the user query, document retrieval, and LLM response generation. It jumps directly from indexing to evaluation and deployment, which is not a complete RAG implementation.
- Why D is incorrect: Option D starts with the user submitting a query before the documents are ingested and indexed. This is impossible, as the LLM needs a knowledge base to retrieve relevant documents from. Document ingestion and indexing must occur before any queries can be processed.
A Generative AI Engineer just deployed an LLM application at a digital marketing company that assists with answering customer service inquiries.
Which metric should they monitor for their customer service LLM application in production?
A. Number of customer inquiries processed per unit of time
B. Energy usage per query
C. Final perplexity scores for the training of the model
D. HuggingFace Leaderboard values for the base LLM
The correct answer is A. Number of customer inquiries processed per unit of time.
- Why A is correct: This metric directly reflects the application’s performance in a production environment for customer service. It measures the efficiency and throughput of the LLM application in handling customer inquiries.
-
Why other options are wrong:
- B. Energy usage per query: While energy efficiency is important, it’s not the primary metric for evaluating the customer service application’s performance in production.
- C. Final perplexity scores for the training of the model: Perplexity is a training metric and not relevant once the model is deployed.
- D. HuggingFace Leaderboard values for the base LLM: HuggingFace Leaderboard values are used during the development phase, not during production monitoring.
A Generative AI Engineer is building a Generative AI system that suggests the best matched employee team member to newly scoped projects. The team member is selected from a very large team. The match should be based upon project date availability and how well their employee profile matches the project scope. Both the employee profile and project scope are unstructured text.
How should the Generative Al Engineer architect their system?
A. Create a tool for finding available team members given project dates. Embed all project scopes into a vector store, perform a retrieval using team member profiles to find the best team member.
B. Create a tool for finding team member availability given project dates, and another tool that uses an LLM to extract keywords from project scopes. Iterate through available team members’ profiles and perform keyword matching to find the best available team member.
C. Create a tool to find available team members given project dates. Create a second tool that can calculate a similarity score for a combination of team member profile and the project scope. Iterate through the team members and rank by best score to select a team member.
D. Create a tool for finding available team members given project dates. Embed team profiles into a vector store and use the project scope and filtering to perform retrieval to find the available best matched team members.
**
**
D is the correct answer.
- Why D is correct: Option D is the most scalable and efficient approach. By embedding team profiles into a vector store, the system can quickly retrieve the best-matched team members for a given project scope using vector similarity search. This approach is particularly well-suited for very large teams because it avoids iterating through all team member profiles.
- Why A is incorrect: Embedding project scopes instead of team member profiles is less efficient. The number of projects is likely to grow, so it is more effective to have a fixed number of team embeddings.
- Why B is incorrect: Keyword matching is less effective than embedding similarity and not as useful for unstructured text. Additionally, iterating through team members does not scale to “very large teams”.
- Why C is incorrect: Calculating similarity scores by iterating through all team members is inefficient for a large team. Furthermore, the “similarity score” is vague and does not take advantage of an established and effective method like vector embeddings and similarity search.
A Generative AI Engineer has a provisioned throughput model serving endpoint as part of a RAG application and would like to monitor the serving endpoint’s incoming requests and outgoing responses. The current approach is to include a micro-service in between the endpoint and the user interface to write logs to a remote server.
Which Databricks feature should they use instead which will perform the same task?
A. Vector Search
B. Lakeview
C. DBSQL
D. Inference Tables
**
**
D. Inference Tables
- Why D is correct: Inference Tables are designed to store and manage prediction results from machine learning models, which includes recording request and response data. This allows for monitoring incoming requests and outgoing responses effectively.
-
Why other options are incorrect:
- A. Vector Search is used for similarity search of embeddings, not for logging requests and responses.
- B. Lakeview is for creating data-driven dashboards for data analysis and not for this task.
- C. DBSQL is used for querying data, not for logging requests and responses from a serving endpoint.
A Generative Al Engineer is building a system which will answer questions on latest stock news articles. Which will NOT help with ensuring the outputs are relevant to financial news?
A. Implement a comprehensive guardrail framework that includes policies for content filters tailored to the finance sector.
B. Increase the compute to improve processing speed of questions to allow greater relevancy analysis
C. Implement a profanity filter to screen out offensive language.
D. Incorporate manual reviews to correct any problematic outputs prior to sending to the users
**
**
B. Increasing compute power primarily improves processing speed but does not inherently improve the relevancy of the answers to financial news. Relevancy is determined by the data sources, retrieval methods, and filtering mechanisms, not processing speed. Options A, C, and D directly contribute to ensuring relevance: A uses tailored content filters, C filters offensive language to keep the responses professional, and D uses manual reviews to correct any irrelevant outputs.
A Generative AI Engineer has been asked to build an LLM-based question-answering application. The application should take into account new documents that are frequently published. The engineer wants to build this application with the least cost and least development effort and have it operate at the lowest cost possible.
Which combination of chaining components and configuration meets these requirements?
A. For the application a prompt, a retriever, and an LLM are required. The retriever output is inserted into the prompt which is given to the LLM to generate answers.
B. The LLM needs to be frequently with the new documents in order to provide most up-to-date answers.
C. For the question-answering application, prompt engineering and an LLM are required to generate answers.
D. For the application a prompt, an agent and a fine-tuned LLM are required. The agent is used by the LLM to retrieve relevant content that is inserted into the prompt which is given to the LLM to generate answers.
**
**
A. This option is the most suitable because it uses a retriever to fetch information from new documents and insert it into the prompt for the LLM. This approach effectively provides up-to-date information, reflecting frequently updated documentation, while minimizing cost and development effort.
- Why A is correct: It provides a cost-effective and efficient way to incorporate new documents into the question-answering application by using a retriever to find relevant information and inserting it into the prompt.
- Why B is wrong: It only mentions the update frequency of the LLM but doesn’t describe the application architecture.
- Why C is wrong: It mentions prompt engineering and LLM but does not explain how to handle updates for new documents.
- Why D is wrong: It describes an agent-using approach, but lacks specifics on how an agent-using structure would achieve the same effective information retrieval and insertion as A. Also, using a fine-tuned LLM is more costly.
A Generative AI Engineer is using the code below to test setting up a vector store:
Assuming they intend to use Databricks managed embeddings with the default embedding model, what should be the next logical function call? [https://www.examtopics.com/discussions/databricks/view/150263-exam-certified-generative-ai-engineer-associate-topic-1/]
A. vsc.get_index()
B. vsc.create_delta_sync_index()
C. vsc.create_direct_access_index()
D. vsc.similarity_search()
The correct answer is C. vsc.create_direct_access_index()
.
Explanation:
create_direct_access_index()
is the appropriate next step when testing the setup of a vector store, especially when using Databricks managed embeddings with the default embedding model without a pre-existing Delta table. This method allows for manually adding documents and embeddings, which is ideal for initial testing and minimal setup.
Why other options are wrong:
-
A.
vsc.get_index()
: This function is used to retrieve an existing index, not create a new one. Since the engineer is setting up the vector store, an index likely doesn’t exist yet. -
B.
vsc.create_delta_sync_index()
: This option is suitable for production workflows where a Delta table is already in use and the index needs to automatically synchronize with it. It’s not appropriate for a minimal test setup. -
D.
vsc.similarity_search()
: This function is used to perform a similarity search on an existing index. An index must be created and populated first.
A Generative AI Engineer wants to build an LLM-based solution to help a restaurant improve its online customer experience with bookings by automatically handling common customer inquiries. The goal of the solution is to minimize escalations to human intervention and phone calls while maintaining a personalized interaction. To design the solution, the Generative AI Engineer needs to define the input data to the LLM and the task it should perform.
Which input/output pair will support their goal?
A. Input: Online chat logs; Output: Group the chat logs by users, followed by summarizing each user’s interactions
B. Input: Online chat logs; Output: Buttons that represent choices for booking details
C. Input: Customer reviews; Output: Classify review sentiment
D. Input: Online chat logs; Output: Cancellation options
B. Input: Online chat logs; Output: Buttons that represent choices for booking details
This is the best answer because it allows for the automatic handling of customer inquiries in a structured way, providing immediate responses to questions about reservations.
* Why A is wrong: Summarizing user interactions, while potentially useful, doesn’t directly address the task of handling customer inquiries and minimizing human intervention during the booking process.
* Why C is wrong: Classifying review sentiment is more focused on understanding customer feedback than directly assisting with booking inquiries.
* Why D is wrong: Providing cancellation options is a limited functionality and doesn’t address the broader goal of handling various customer inquiries related to bookings.
What is an effective method to preprocess prompts using custom code before sending them to an LLM?
A. Directly modify the LLM’s internal architecture to include preprocessing steps
B. It is better not to introduce custom code to preprocess prompts as the LLM has not been trained with examples of the preprocessed prompts
C. Rather than preprocessing prompts, it’s more effective to postprocess the LLM outputs to align the outputs to desired outcomes
D. Write a MLflow PyFunc model that has a separate function to process the prompts
D. Writing an MLflow PyFunc model with a separate function to process prompts allows for systematic and flexible preprocessing, potentially improving LLM performance through optimized prompts.
- Why D is correct: This approach enables organized and adaptable prompt manipulation before they are fed into the LLM.
- Why A is wrong: Modifying the LLM’s internal architecture is generally not feasible or practical.
- Why B is wrong: Preprocessing prompts with custom code can be beneficial for optimizing LLM performance.
- Why C is wrong: While post-processing is valuable, pre-processing can proactively shape the input for better initial results.
A Generative AI Engineer is developing an LLM application that users can use to generate personalized birthday poems based on their names.
Which technique would be most effective in safeguarding the application, given the potential for malicious user inputs?
A. Implement a safety filter that detects any harmful inputs and ask the LLM to respond that it is unable to assist
B. Reduce the time that the users can interact with the LLM
C. Ask the LLM to remind the user that the input is malicious but continue the conversation with the user
D. Increase the amount of compute that powers the LLM to process input faster
A. Implementing a safety filter is the most effective technique. It directly addresses the potential for malicious input by detecting and blocking harmful content, thus safeguarding the application.
- Why A is correct: Safety filters prevent the LLM from processing and potentially generating harmful or inappropriate responses based on malicious input.
- Why B is incorrect: Reducing interaction time doesn’t prevent malicious input, it only limits the duration of potential harm.
- Why C is incorrect: Reminding the user that the input is malicious while continuing the conversation does not prevent the generation of harmful outputs and could potentially encourage further malicious behavior.
- Why D is incorrect: Increasing compute power doesn’t address the safety concerns related to malicious input. It only affects processing speed.
Which indicator should be considered to evaluate the safety of the LLM outputs when qualitatively assessing LLM responses for a translation use case?
A. The ability to generate responses in code
B. The similarity to the previous language
C. The latency of the response and the length of text generated
D. The accuracy and relevance of the responses
D. Accuracy and relevance are key to ensuring the LLM output is safe and appropriate in a translation use case.
- Why D is correct: Accuracy and relevance directly relate to whether the translated output conveys the intended meaning without introducing harmful or misleading information.
-
Why the other options are incorrect:
- A: Code generation is irrelevant to translation safety.
- B: Similarity to the original language doesn’t guarantee safety; a harmful statement could be translated faithfully.
- C: Latency and text length are performance metrics, not safety indicators.
A Generative AI Engineer is developing a patient-facing healthcare-focused chatbot. If the patient’s question is not a medical emergency, the chatbot should solicit more information from the patient to pass to the doctor’s office and suggest a few relevant pre-approved medical articles for reading. If the patient’s question is urgent, direct the patient to calling their local emergency services.
Given the following user input:
“I have been experiencing severe headaches and dizziness for the past two days.”
Which response is most appropriate for the chatbot to generate?
A. Here are a few relevant articles for your browsing. Let me know if you have questions after reading them.
B. Please call your local emergency services.
C. Headaches can be tough. Hope you feel better soon!
D. Please provide your age, recent activities, and any other symptoms you have noticed along with your headaches and dizziness.
**
**
B. Please call your local emergency services.
- Why this is right: Severe headaches and dizziness persisting for two days can indicate a serious medical condition requiring immediate attention. The chatbot should prioritize patient safety and direct the user to emergency services.
-
Why other options are wrong:
- A: Providing articles is inappropriate for potentially urgent symptoms.
- C: This is a dismissive and unhelpful response.
- D: Delaying immediate help by asking for more information is dangerous when the symptoms suggest a possible emergency.
After changing the response generating LLM in a RAG pipeline from GPT-4 to a model with a shorter context length that the company self-hosts, the Generative AI Engineer is getting the following error:
Image Text: ValueError: This model's maximum context length is 2048 tokens. However, you requested 2049 tokens (1025 in the messages, 1024 in the completion). Please reduce the length of the messages or completion.
What TWO solutions should the Generative AI Engineer implement without changing the response generating model? (Choose two.)
A. Use a smaller embedding model to generate embeddings
B. Reduce the maximum output tokens of the new model
C. Decrease the chunk size of embedded documents
D. Reduce the number of records retrieved from the vector database
E. Retrain the response generating model using ALiBi
CD
C is correct because decreasing the chunk size of embedded documents directly reduces the number of tokens included in the prompt, addressing the context length issue.
D is correct because reducing the number of records retrieved from the vector database limits the amount of information passed to the LLM, thereby reducing the total number of tokens in the prompt.
A is incorrect because using a smaller embedding model affects the quality of embeddings, but it doesn’t directly address the token limit issue in the prompt.
B is incorrect because reducing the maximum output tokens limits the model’s ability to generate complete responses, which is undesirable.
E is incorrect because retraining the response generating model using ALiBi is complex and unnecessary. The issue can be resolved by reducing the input tokens without retraining.
A Generative AI Engineer is building a RAG application that answers questions about internal documents for the company SnoPen AI. The source documents may contain a significant amount of irrelevant content, such as advertisements, sports news, or entertainment news, or content about other companies. Which approach is advisable when building a RAG application to achieve this goal of filtering irrelevant information?
A. Keep all articles because the RAG application needs to understand non-company content to avoid answering questions about them.
B. Include in the system prompt that any information it sees will be about SnoPenAI, even if no data filtering is performed.
C. Include in the system prompt that the application is not supposed to answer any questions unrelated to SnoPen AI.
D. Consolidate all SnoPen AI related documents into a single chunk in the vector database.
The correct answer is C.
- Why C is correct: By specifying in the system prompt that the application should not answer questions unrelated to SnoPen AI, you directly instruct the model to filter out irrelevant information. This allows the application to focus solely on questions relevant to the company, improving accuracy and efficiency.
-
Why other options are wrong:
- A: Keeping all articles, including irrelevant content, would dilute the context and potentially lead the model to answer questions outside the scope of SnoPen AI, defeating the purpose of filtering.
- B: Including in the prompt that all information is about SnoPen AI, without any data filtering, would be misleading to the model, especially if the ingested documents contain information about other topics. The model would likely generate inaccurate responses.
- D: Consolidating all SnoPen AI related documents into a single chunk might seem beneficial, but it could create an overly large chunk that exceeds the model’s context window, leading to truncation or incomplete information retrieval. Also, it doesn’t inherently filter out irrelevant information within those documents.
A Generative Al Engineer has successfully ingested unstructured documents and chunked them by document sections. They would like to store the chunks in a Vector Search index. The current format of the dataframe has two columns: (i) original document file name (ii) an array of text chunks for each document.
What is the most performant way to store this dataframe?
A. Split the data into train and test set, create a unique identifier for each document, then save to a Delta table
B. Flatten the dataframe to one chunk per row, create a unique identifier for each row, and save to a Delta table
C. First create a unique identifier for each document, then save to a Delta table
D. Store each chunk as an independent JSON file in Unity Catalog Volume. For each JSON file, the key is the document section name and the value is the array of text chunks for that section
B. Flatten the dataframe to one chunk per row, create a unique identifier for each row, and save to a Delta table
This is the most performant because flattening the dataframe ensures each chunk is a distinct row, optimized for vector search indexing. The unique identifier enables efficient retrieval. Options A and C do not address the need to index individual chunks. Option D is less performant due to the overhead of managing numerous JSON files.