This blog is co-written with Josh Reini, Shayak Sen and Anupam Datta from TruEra.
Amazon SageMaker JumpStart provides a variety of pretrained foundation models such as Llama-2 and Mistal 7B that can be quickly deployed to an endpoint. These foundation models perform well with generative tasks, from crafting text and summaries, answering questions, to producing images and videos. Despite the great generalization capabilities of these models, there are often use cases where these models have to be adapted to new tasks or domains. One way to surface this need is by evaluating the model against a curated ground truth dataset. After the need to adapt the foundation model is clear, you can use a set of techniques to carry that out. A popular approach is to fine-tune the model using a dataset that is tailored to the use case. Fine-tuning can improve the foundation model and its efficacy can again be measured against the ground truth dataset. This notebook shows how to fine-tune models with SageMaker JumpStart.
One challenge with this approach is that curated ground truth datasets are expensive to create. In this post, we address this challenge by augmenting this workflow with a framework for extensible, automated evaluations. We start off with a baseline foundation model from SageMaker JumpStart and evaluate it with TruLens, an open source library for evaluating and tracking large language model (LLM) apps. After we identify the need for adaptation, we can use fine-tuning in SageMaker JumpStart and confirm improvement with TruLens. TruLens evaluations use an abstraction of feedback functions. These functions can be implemented in several ways, including BERT-style models, appropriately prompted LLMs, and more. TruLens’ integration with Amazon Bedrock allows you to run evaluations using LLMs available from Amazon Bedrock. The reliability of the Amazon Bedrock infrastructure is particularly valuable for use in performing evaluations across development and production. This post serves as both an introduction to TruEra’s place in the modern LLM app stack and a hands-on guide to using Amazon SageMaker and TruEra to deploy, fine-tune, and iterate on LLM apps.
Here is the complete notebook with code samples to show performance evaluation using TruLens.
TruEra in the LLM app stack
TruEra lives at the observability layer of LLM apps. Although new components have worked their way into the compute layer (fine-tuning, prompt engineering, model APIs) and storage layer (vector databases), the need for observability remains. This need spans from development to production and requires interconnected capabilities for testing, debugging, and production monitoring, as illustrated in the following figure.
In development, you can use open source TruLens to quickly evaluate, debug, and iterate on your LLM apps in your environment. A comprehensive suite of evaluation metrics, including both LLM-based and traditional metrics available in TruLens, allows you to measure your app against criteria required for moving your application to production. In production, these logs and evaluation metrics can be processed at scale with TruEra production monitoring. By connecting production monitoring with testing and debugging, dips in performance such as hallucination, safety, security, and more can be identified and corrected.
Deploy foundation models in SageMaker
You can deploy foundation models such as Llama-2 in SageMaker with just two lines of Python code:
“`html
from sagemaker.jumpstart.model import JumpStartModel
pretrained_model = JumpStartModel(model_id="meta-textgeneration-llama-2-7b")
pretrained_predictor = pretrained_model.deploy()
“`
Invoke the model endpoint
After deployment, you can invoke the deployed model endpoint by first creating a payload containing your inputs and model parameters:
“`html
payload = {
"inputs": "I believe the meaning of life is",
"parameters": {
"max_new_tokens": 64,
"top_p": 0.9,
"temperature": 0.6,
"return_full_text": False,
},
}
“`
Then you can simply pass this payload to the endpoint’s predict method. Note that you must pass the attribute to accept the end-user license agreement each time you invoke the model:
“`html
response = pretrained_predictor.predict(payload, custom_attributes="accept_eula=true")
“`
Evaluate performance with TruLens
Now you can use TruLens to set up your evaluation. TruLens is an observability tool, offering an extensible set of feedback functions to track and evaluate LLM-powered apps. Feedback functions are essential here in verifying the absence of hallucination in the app. These feedback functions are implemented by using off-the-shelf models from providers such as Amazon Bedrock. Amazon Bedrock models are an advantage here because of their verified quality and reliability. You can set up the provider with TruLens via the following code:
“`html
from trulens_eval import Bedrock
# Initialize AWS Bedrock feedback function collection class:
provider = Bedrock(model_id = "amazon.titan-tg1-large", region_name="us-east-1")
“`
In this example, we use three feedback functions: answer relevance, context relevance, and groundedness. These evaluations have quickly become the standard for hallucination detection in context-enabled question answering applications and are especially useful for unsupervised applications, which cover the vast majority of today’s LLM applications. Let’s go through each of these feedback functions to understand how they can benefit us.
Context relevance
Context is a critical input to the quality of our application’s responses, and it can be useful to programmatically ensure that the context provided is relevant to the input query. This is critical because this context will be used by the LLM to form an answer, so any irrelevant information in the context could be weaved into a hallucination. TruLens enables you to evaluate context relevance by using the structure of the serialized record:
“`html
f_context_relevance = (Feedback(provider.relevance, name = "Context Relevance")
.on(Select.Record.calls[0].args.args[0])
.on(Select.Record.calls[0].args.args[1])
)
“`
Because the context provided to LLMs is the most consequential step of a Retrieval Augmented Generation (RAG) pipeline, context relevance is critical for understanding the quality of retrievals. Working with customers across sectors, we’ve seen a variety of failure modes identified using this evaluation, such as incomplete context, extraneous irrelevant context, or even lack of sufficient context available. By identifying the nature of these failure modes, our users are able to adapt their indexing (such as embedding model and chunking) and retrieval strategies (such as sentence windowing and automerging) to mitigate these issues.
Groundedness
After the context is retrieved, it is then formed into an answer by an LLM. LLMs are often prone to stray from the facts provided, exaggerating or expanding to a correct-sounding answer. To verify the groundedness of the application, you should separate the response into separate statements and independently search for evidence that supports each within the retrieved context. grounded = Groundedness(groundedness_provider=provider) f_groundedness = (Feedback(grounded.groundedness_measure, name = “Groundedness”) .on(Select.Record.calls[0].args.args[1]) .on_output() .aggregate(grounded.grounded_statements_aggregator) )
Issues with groundedness can often be a downstream effect of context relevance. When the LLM lacks sufficient context to form an evidence-based response, it is more likely to hallucinate in its attempt to generate a plausible response. Even in cases where complete and relevant context is provided, the LLM can fall into issues with groundedness. Particularly, this has played out in applications where the LLM responds in a particular style or is being used to complete a task it is not well suited for. Groundedness evaluations allow TruLens users to break down LLM responses claim by claim to understand where the LLM is most often hallucinating. Doing so has shown to be particularly useful for illuminating the way forward in eliminating hallucination through model-side changes (such as prompting, model choice, and model parameters).
Answer relevance
Lastly, the response still needs to helpfully answer the original question. You can verify this by evaluating the relevance of the final response to the user input:
“`html
f_answer_relevance = (Feedback(provider.relevance, name = "Answer Relevance")
.on(Select.Record.calls[0].args.args[0])
.on_output()
)
“`
By reaching satisfactory evaluations for this triad, you can make a nuanced statement about your application’s correctness; this application is verified to be hallucination free up to the limit of its knowledge base. In other words, if the vector database contains only accurate information, then the answers provided by the context-enabled question answering app are also accurate.
Ground truth evaluation
In addition to these feedback functions for detecting hallucination, we have a test dataset, DataBricks-Dolly-15k, that enables us to add ground truth similarity as a fourth evaluation metric. See the following code:
“`html
from datasets import load_dataset
dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# To train for question answering/information extraction, you can replace the assertion in next line to example["category"] == "closed_qa"/"information_extraction".
summarization_dataset = dolly_dataset.filter(lambda example: example["category"] == "summarization")
summar
Source link