A Deep Dive into In-Context Learning | by Aris Tsakpinis

Stepping out of the “comfort zone” — part 2/3 of a deep-dive into domain adaptation approaches for LLMs

Photo by StableDiffusionXL on Amazon Web Services

Exploring domain adapting large language models (LLMs) to your specific domain or use case? This 3-part blog post series explains the motivation for domain adaptation and dives deep into various options to do so. Further, a detailed guide for mastering the entire domain adaptation journey covering popular tradeoffs is being provided.

Part 1: Introduction into domain adaptation — motivation, options, tradeoffs Part 2: A deep dive into in-context learning — You’re here!Part 3: A deep dive into fine-tuning

Note: All images, unless otherwise noted, are by the author.

In the first part of this blog post series, we discussed the rapid advancements in generative AI and the emergence of large language models (LLMs) like Claude, GPT-4, Meta LLaMA, and Stable Diffusion. These models have demonstrated remarkable capabilities in content creation, sparking both enthusiasm and concerns about potential risks. We highlighted that while these AI models are powerful, they also have inherent limitations and “comfort zones” — areas where they excel, and areas where their performance can degrade when pushed outside their expertise. This can lead to model responses that fall below the expected quality, potentially resulting in hallucinations, biased outputs, or other undesirable behaviors.

To address these challenges and enable the strategic use of generative AI in enterprises, we introduced three key design principles: Helpfulness, Honesty, and Harmlessness. We also discussed how domain adaptation techniques, such as in-context learning and fine-tuning, can be leveraged to overcome the “comfort zone” limitations of these models and create enterprise-grade, compliant generative AI-powered applications. In this second part, we will dive deeper into the world of in-context learning, exploring how these techniques can be used to transform tasks and move them back into the models’ comfort zones.

In-context learning aims to make use of external tooling to modify the task to be solved in a way that moves it back (or closer) into a model’s comfort zone. In the world of LLMs, this can be done through prompt engineering, which involves infusing source knowledge through the model prompt to transform the overall complexity of a task. It can be executed in a rather static manner (e.g. few-shot prompting), but more sophisticated, dynamic prompt engineering techniques like retrieval-augmented generation (RAG) or Agents have proven to be powerful.

Figure 1: in-context learning to overcome hallucinations — Source: Claude 3 Sonnet via Amazon Bedrock

In part 1 of this blog post series we noticed alongside the example depicted in figure 1 how adding a static context like a speaker bio can help reduce the complexity of the task to be solved by the model, leading to better model results. In what follows, we will dive deeper into more advanced concepts of in-context learning.

“The measure of intelligence is the ability to change.” (Albert Einstein)

While the above example with static context infusion works well for static use cases, it lacks the ability to scale across diverse and complex domains. Assuming the scope of our closed QA task would not be limited to me as a person only, but to all speakers of a huge conference and hence hundreds of speaker bios. In this case, manual identification and insertion of the relevant piece of context (i.e. the speaker bio) becomes cumbersome, error-prone, and impractical. In theory, recent models come with huge context sizes up to 200k tokens or more, fitting not only those hundreds of speaker bios, but entire books and knowledge bases. However, there is plenty of reasons why this is not a desirable approach, like cost in a pay per token approach, compute requirements, latency, etc. .

Luckily, plenty of optimized content retrieval approaches concerned with identifying exactly the piece of context most suitable to ingest in a dynamic approach exist — some of a deterministic nature (e.g. SQL-queries on structured data), others powered by probabilistic systems (e.g. semantic search). Chaining these two components together into an integrated closed Q&A approach with dynamic context retrieval and infusion has proven to be extremely powerful. Thereby, a huge (endless?) variety of data sources — from relational or graph databases over vector stores to enterprise systems or real-time APIs — can be connected. To accomplish this, the identified context piece(s) of highest relevance is (are) extracted and dynamically ingested into the prompt template used against the generative decoder model when accomplishing the desired task. Figure 2 shows this exemplarily for a user-facing Q&A application (e.g., a chatbot).

Figure 2: dynamic context infusion with various data sources

The by far most popular approach to dynamic prompt engineering is RAG. The approach works well when trying to ingest context originating from large full-text knowledge bases dynamically. It combines two probabilistic methods by augmenting an open Q&A task with dynamic context retrieved by semantic search, turning an open Q&A task into a closed one.

Figure 3: retrieval-augmented generation (RAG) on AWS

First, the documents are being sliced into chunks of digestible size. Then, an encoder LLM is used for creating contextualised embeddings of these snippets, encoding the semantics of every chunk into the mathematical space in the form of a vector. This information is stored in a vector database, which acts as our knowledge base. Thereby, the vector is used as the primary key, whereas the text itself, together with optional metadata, is stored alongside.

(0) In case of a user question, the input submitted is cleaned and encoded by the very same embeddings model, creating a semantic representation of the user’s question in the knowledge base’s vector space.

(1) This embedding is subsequently used for carrying out a similarity search based on vector distance metrics over the entire knowledge base — with the hypothesis that the k snippets with the highest similarity to the user’s question in the vector space are likely best suited for grounding the question with context.

(2) In the next step, these top k snippets are passed to a decoder generative LLM as context alongside the user’s initial question, forming a closed Q&A task.

(3) The LLM answers the question in a grounded way in the style instructed by the application’s system prompt (e.g., chatbot style).

Knowledge Graph-Augmented Generation (KGAG) is another dynamic prompting approach that integrates structured knowledge graphs to transform the task to be solved and hence enhance the factual accuracy and informativeness of language model outputs. Integrating knowledge graphs can be achieved by several approaches.

Figure 4: Knowledge Graph augmented generation (KGAG) — Source: Kang et al (2023)

As one of those, the KGAG framework proposed by Kang et al (2023) consists of three key components:

(1) The context-relevant subgraph retriever retrieves a relevant subgraph Z from the overall knowledge graph G given the current dialogue history x. To do this, the model defines a retrieval score for each individual triplet z = (eh, r, et) in the knowledge graph, computed as the inner product between embeddings of the dialogue history x and the candidate triplet z. The triplet embeddings are generated using Graph Neural Networks (GNNs) to capture the relational structure of the knowledge graph. The retrieval distribution p(Z|x) is then computed as the product of the individual triplet retrieval scores p(z|x), allowing the model to retrieve only the most relevant subgraph Z for the given dialogue context.

(2) The model needs to encode the retrieved subgraph Z along with the text sequence x for the language model. A naive approach would be to simply prepend the tokens of entities and relations in Z to the input x, but this violates important properties like permutation invariance and relation inversion invariance. To address this, the paper proposes an “invariant and efficient” graph encoding method. It first sorts the unique entities in Z and encodes them, then applies a learned affine transformation to perturb the entity embeddings based on the graph structure. This satisfies the desired invariance properties while also being more computationally efficient than prepending all triplet tokens.

(3) The model uses a contrastive learning objective to ensure the generated text is consistent with the retrieved subgraph Z. Specifically, it maximizes the similarity between the representations of the retrieved subgraph and the generated text, while minimizing the similarity to negative samples. This encourages the model to generate responses that faithfully reflect the factual knowledge contained in the retrieved subgraph.

By combining these three components — subgraph retrieval, invariant graph encoding, and graph-text contrastive learning — the KGAG framework can generate knowledge-grounded responses that are both fluent and factually accurate.

KGAG is particularly useful in dialogue systems, question answering, and other applications where generating informative and factually accurate responses is important. It can be applied in domains where there is access to a relevant knowledge graph, such as encyclopaedic knowledge, product information, or domain-specific facts. By combining the strengths of language models and structured knowledge, KGAG can produce responses that are both natural and trustworthy, making it a valuable tool for building intelligent conversational agents and knowledge-intensive applications.

Chain-of-Thought is a prompt engineering approach introduced by Wei et al in 2023. By providing the model with either instructions or few-shot examples of structured reasoning steps towards a problem solution, it reduces the complexity of the problem to be solved by the model significantly.