In this article, we’ll explore the use of prompt compression techniques in the early stages of development, which can help reduce the ongoing operating costs of GenAI-based applications. Often, generative AI applications utilize the retrieval-augmented generation framework, alongside prompt engineering, to extract the best output from the underlying large language models. However, this approach may not be cost-effective in the long run, as operating costs can significantly increase when your application scales in production and relies on model providers like OpenAI or Google Gemini, among others. The prompt compression techniques we’ll explore below can significantly lower operating costs.
## Challenges Faced while Building the RAG-based GenAI App
RAG (or retrieval-augmented generation) is a popular framework for building GenAI-based applications powered by a vector database, where the semantically relevant data is augmented to the input of the large language model’s context window to generate the content. While building our GenAI application, we encountered an unexpected issue of rising costs when we put the app into production and all the end users started using it. After thorough inspection, we found this was mainly due to the amount of data we needed to send to OpenAI for each user interaction. The more information or context we provided so the large language model could understand the conversation, the higher the expense. This problem was especially identified in our Q&A chat feature, which we integrated with OpenAI. To keep the conversation flowing naturally, we had to include the entire chat history in every new query. As you may know, the large language model has no memory of its own, so if we didn’t resend all the previous conversation details, it couldn’t make sense of the new questions based on past discussions. This meant that, as users kept chatting, each message sent with the full history increased our costs significantly. Though the application was quite successful and delivered the best user experience, it failed to keep the cost of operating such an application low enough. A similar example can be found in applications that generate personalized content based on user inputs. Suppose a fitness app uses GenAI to create custom workout plans. If the app needs to consider a user’s entire exercise history, preferences, and feedback each time it suggests a new workout, the input size becomes quite large. This large input size, in turn, means higher costs for processing. Another scenario could involve a recipe recommendation engine. If the engine tries to consider a user’s dietary restrictions, past likes and dislikes, and nutritional goals with each recommendation, the amount of information sent for processing grows. As with the chat application, this larger input size translates into higher operational costs. In each of these examples, the key challenge is balancing the need to provide enough context for the LLM to be useful and personalized, without letting the costs spiral out of control due to the large amount of data being processed for each interaction.
## How We Solved the Rising Cost of the RAG Pipeline
In facing the challenge of rising operational costs associated with our GenAI applications, we zeroed in on optimizing our communication with the AI models through a strategy known as “prompt engineering”. Prompt engineering is a crucial technique that involves crafting our queries or instructions to the underlying LLM in such a way that we get the most precise and relevant responses. The goal is to enhance the model’s output quality while simultaneously reducing the operational expenses involved. It’s about asking the right questions in the right way, ensuring the LLM can perform efficiently and cost-effectively. In our efforts to mitigate these costs, we explored a variety of innovative approaches within the areas of prompt engineering, aiming to add value while keeping expenses manageable. Our exploration helped us to discover the efficacy of the prompt compression technique. This approach streamlines the communication process by distilling our prompts down to their most essential elements, stripping away any unnecessary information. This not only reduces the computational burden on the GenAI system, but also significantly lowers the cost of deploying GenAI solutions — particularly those reliant on retrieval-augmented generation technologies. By implementing the prompt compression technique, we’ve been able to achieve considerable savings in the operational costs of our GenAI projects. This breakthrough has made it feasible to leverage these advanced technologies across a broader spectrum of business applications without the financial strain previously associated with them. Our journey through refining prompt engineering practices underscores the importance of efficiency in GenAI interactions, proving that strategic simplification can lead to more accessible and economically viable GenAI solutions for businesses. We not only used the tools to help us reduce the operating costs, but also to revamp the prompts we used to get the response from the LLM. Using the tool, we noticed almost 51% of savings in the cost. But when we followed GPT’s own prompt compression technique — by rewriting either the prompts or using GPT’s own suggestion to shorten the prompts — we found almost a 70-75% cost reduction. We used OpenAI’s tokenizer tool to play around with the prompts to identify how far we could reduce them while getting the same exact output from OpenAI. The tokenizer tool helps you to calculate the exact tokens that will be utilized by the LLMs as part of the context window.
## Prompt examples
Let’s look at some examples of these prompts.
– **Trip to Italy**
– Original prompt: I am currently planning a trip to Italy and I want to make sure that I visit all the must-see historical sites as well as enjoy some local cuisine. Could you provide me with a list of top historical sites in Italy and some traditional dishes I should try while I am there?
– Compressed prompt: Italy trip: List top historical sites and traditional dishes to try.
– **Healthy recipe**
– Original prompt: I am looking for a healthy recipe that I can make for dinner tonight. It needs to be vegetarian, include ingredients like tomatoes, spinach, and chickpeas, and it should be something that can be made in less than an hour. Do you have any suggestions?
– Compressed prompt: Need a quick, healthy vegetarian recipe with tomatoes, spinach, and chickpeas. Suggestions?
## Understanding Prompt Compression
It’s crucial to craft effective prompts for utilizing large language models in real-world enterprise applications. Strategies like providing step-by-step reasoning, incorporating relevant examples, and including supplementary documents or conversation history play a vital role in improving model performance for specialized NLP tasks. However, these techniques often produce longer prompts, as an input that can span thousands of tokens or words, and so it increases the input context window. This substantial increase in prompt length can significantly drive up the costs associated with employing advanced models, particularly expensive LLMs like GPT-4. This is why prompt engineering must integrate other techniques to balance between providing comprehensive context and minimizing computational expense. Prompt compression is a technique used to optimize the way we use prompt engineering and the input context to interact with large language models. When we provide prompts or queries to an LLM, as well as any relevant contextually aware input content, it processes the entire input, which can be computationally expensive, especially for longer prompts with lots of data. Prompt compression aims to reduce the size of the input by condensing the prompt to its most essential relevant components, removing any unnecessary or redundant information so that the input content stays within the limit. The overall process of prompt compression typically involves analyzing the prompt and identifying the key elements that are crucial for the LLM to understand the context and generate a relevant response. These key elements could be specific keywords, entities, or phrases that capture the core meaning of the prompt. The compressed prompt is then created by retaining these essential components and discarding the rest of the contents. Implementing prompt compression in the RAG pipeline has several benefits:
– **Reduced computational load:** By compressing the prompts, the LLM needs to process less input data, resulting in a reduced computational load. This can lead to faster response times and lower computational costs.
– **Improved cost-effectiveness:** Most of the LLM providers charge based on the number of tokens (words or subwords) passed as part of the input context window and being processed. By using compressed prompts, the number of tokens is greatly reduced, leading to significant lower costs for each query or interaction with the LLM.
By effectively implementing prompt compression techniques, businesses can not only enhance the performance of their GenAI applications but also keep their operational costs in check, allowing for wider adoption of these advanced technologies.
Source link