Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

Exciting news! The Jina Embeddings v2 model, developed by Jina AI, is now accessible to customers on Amazon SageMaker JumpStart for easy deployment for model inference with just one click. This cutting-edge model supports an impressive 8,192-tokens context length. You can quickly deploy this model using SageMaker JumpStart, a hub for machine learning (ML) that offers foundation models, built-in algorithms, and pre-built ML solutions that can be deployed with minimal effort.

Text embedding involves converting text into numerical representations within a high-dimensional vector space. Text embeddings have a wide range of applications in enterprise artificial intelligence (AI), including multimodal search for ecommerce, content personalization, recommender systems, and data analytics.

Jina Embeddings v2 is a top-notch collection of text embedding models, developed by Jina AI in Berlin, known for their high performance on various public benchmarks.

In this article, we will guide you through discovering and deploying the jina-embeddings-v2 model as part of a Retrieval Augmented Generation (RAG)-based question answering system in SageMaker JumpStart. This tutorial can serve as a starting point for building chatbot-based solutions for customer service, internal support, and question answering systems utilizing internal and private documents.

Understanding RAG

RAG is the process of enhancing the output of a large language model (LLM) by referencing a credible knowledge base outside of its training data sources before generating a response.

LLMs are trained on vast amounts of data and utilize billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the capabilities of LLMs to specific domains or an organization’s internal knowledge base without the need for retraining the model. It provides a cost-effective way to enhance LLM output to remain relevant, accurate, and useful in various contexts.

Benefits of Jina Embeddings v2 for RAG Applications

A RAG system utilizes a vector database as a knowledge retriever. It extracts a query from a user’s prompt and sends it to a vector database to find semantic information reliably. The diagram below illustrates the architecture of a RAG application with Jina AI and Amazon SageMaker.

Jina Embeddings v2 is favored by experienced ML scientists for reasons like:

State-of-the-art performance on various text embedding benchmarks

Long input-context length of 8,192 tokens

Support for bilingual text input with specific language training

Cost-effectiveness in operating with small models and compact embedding vectors

Introduction to SageMaker JumpStart

SageMaker JumpStart offers ML practitioners a range of top-performing foundation models. Developers can deploy these models to dedicated SageMaker instances within a network-isolated environment and customize them using SageMaker for training and deployment.

You can now easily discover and deploy a Jina Embeddings v2 model with Amazon SageMaker Studio or programmatically through the SageMaker Python SDK. This allows you to leverage model performance and MLOps controls with SageMaker features like Amazon SageMaker Pipelines and Amazon SageMaker Debugger. With SageMaker JumpStart, the model is deployed in a secure AWS environment under your VPC controls for enhanced data security.

Jina Embeddings models are available in AWS Marketplace for seamless integration into your deployments when working in SageMaker.

AWS Marketplace enables you to find third-party software, data, and services that run on AWS and manage them from a centralized location.

AWS Marketplace offers a wide array of software listings with flexible pricing options and deployment methods to simplify software licensing and procurement processes.

Overview of the Solution

A notebook is available to create and operate a RAG question answering system using Jina Embeddings and the Mixtral 8x7B LLM in SageMaker JumpStart.

This post provides an outline of the key steps required to bring a RAG application to life using generative AI models on SageMaker JumpStart. While some code and installation steps are omitted for readability, you can access the full Python notebook for execution.

Connecting to a Jina Embeddings v2 endpoint

To begin working with Jina Embeddings v2 models:

In SageMaker Studio, navigate to JumpStart.
Search for “jina” to find Jina AI’s provider page and available models.
Select Jina Embeddings v2 Base – en for English language embeddings.
Click Deploy.
Subscribe on the dialog to access the model on AWS Marketplace.
Return to SageMaker Studio and choose Deploy.
Select the instance and provide a name for the endpoint.
Click Deploy.

Once the endpoint is created, you can connect to it using the provided code snippet:

from jina_sagemaker import Client

client = Client(region_name=region)
endpoint_name = “my-jina-embeddings-endpoint”

client.connect_to_endpoint(endpoint_name=endpoint_name)

Preparing a dataset for indexing

This post uses a public dataset from Kaggle (CC0: Public Domain) containing audio transcripts from the Kurzgesagt – In a Nutshell YouTube channel.

Each row in the dataset includes the video title, URL, and transcript text.

Utilize the provided code to chunk the transcripts before indexing to focus on relevant content for answering user queries:

def chunk_text(text, max_words=1024):
“””
Divide text into chunks where each chunk contains the maximum number of full sentences under `max_words`.
“””
sentences = text.split(‘.’)
chunk = []
word_count = 0

for sentence in sentences:
sentence = sentence.strip(“.”)
if not sentence:
continue

words_in_sentence = len(sentence.split())
if word_count + words_in_sentence <= max_words:
chunk.append(sentence)
word_count += words_in_sentence
else:
if chunk:
yield ‘. ‘.join(chunk).strip() + ‘.’
chunk = [sentence]
word_count = words_in_sentence

if chunk:
yield ‘ ‘.join(chunk).strip() + ‘.’

The max_words parameter determines the maximum word count per chunk for indexing text. Various chunking strategies exist beyond this simple word limit.

However, for the purpose of simplicity, we use this technique in this post.

Index text embeddings for vector search

After you chunk the transcript text, you obtain embeddings for each chunk and link each chunk back to the original transcript and video title:

def generate_embeddings(text_df):
“””
Generate an embedding for each chunk created in the previous step.
“””
chunks = list(chunk_text(text_df[‘Text’]))
embeddings = []

for i, chunk in enumerate(chunks):
response = client.embed(texts=[chunk])
chunk_embedding = response[0][’embedding’]
embeddings.append(np.array(chunk_embedding))

text_df[‘chunks’] = chunks
text_df[’embeddings’] = embeddings
return text_df

print(“Embedding text chunks …”)
df = df.progress_apply(generate_embeddings, axis=1)

The dataframe df contains a column titled embeddings that can be put into any vector database of your choice. Embeddings can then be retrieved from the vector database using a function such as find_most_similar_transcript_segment(query, n), which will retrieve the n closest documents to the given input query by a user.

Prompt a generative LLM endpoint

For question answering based on an LLM, you can use the Mistral 7B-Instruct model on SageMaker JumpStart:

from sagemaker.jumpstart.model import JumpStartModel
from string import Template

# Define the LLM to be used and deploy through Jumpstart.
jumpstart_model = JumpStartModel(model_id=”huggingface-llm-mistral-7b-instruct”, role=role)
model_predictor = jumpstart_model.deploy()

# Define the prompt template to be passed to the LLM
prompt_template = Template(“””
<s>[INST] Answer the question below only using the given context.
The question from the user is based on transcripts of videos from a YouTube
channel.
The context is presented as a ranked list of information in the form of
(video-title, transcript-segment), that is relevant for answering the
user’s question.
The answer should only use the presented context. If the question cannot be
answered based on the context, say so.

Context:
1. Video-title: $title_1, transcript-segment: $segment_1
2. Video-title: $title_2, transcript-segment: $segment_2
3. Video-title: $title_3, transcript-segment: $segment_3

Question: $question

Answer: [/INST]
“””)

Query the LLM

Now, for a query sent by a user, you first find the semantically closest n chunks of transcripts from any video of Kurzgesagt (using vector distance between embeddings of chunks and the users’ query), and provide those chunks as context to the LLM for answering the users’ query:

# Define the query and insert it into the prompt template together with the context to be used to answer the question
question = “Can climate change be reversed by individuals’ actions?”
search_results = find_most_similar_transcript_segment(question)

prompt_for_llm = prompt_template.substitute(
question = question,
title_1 = df.iloc[search_results[0][1]][“Title”].strip(),
segment_1 = search_results[0][0],
title_2 = df.iloc[search_results[1][1]][“Title”].strip(),
segment_2 = search_results[1][0],
title_3 = df.iloc[search_results[2][1]][“Title”].strip(),
segment_3 = search_results[2][0]
)

# Generate the answer to the question passed in the propt
payload = {“inputs”: prompt_for_llm}
model_predictor.predict(payload)

Based on the preceding question, the LLM might respond with an answer such as the following:

Based on the provided context, it does not seem that individuals can solve climate change solely through their personal actions. While personal actions such as using renewable energy sources and reducing consumption can contribute to mitigating climate change, the context suggests that larger systemic changes are necessary to address the issue fully.

Clean up

After you’re done running the notebook, make sure to delete all the resources that you created in the process so your billing is stopped. Use the following code:

model_predictor.delete_model()
model_predictor.delete_endpoint()

Conclusion

By taking advantage of the features of Jina Embeddings v2 to develop RAG applications, together with the streamlined access to state-of-the-art models on SageMaker JumpStart, developers and businesses are now empowered to create sophisticated AI solutions with ease.

Jina Embeddings v2’s extended context length, support for bilingual documents, and small model size enables enterprises to quickly build natural language processing use cases based on their internal datasets without relying on external APIs.

Get started with SageMaker JumpStart today, and refer to the GitHub repository for the complete code to run this sample.