Stepping out of the “comfort zone” — part 3/3 of a deep-dive into domain adaptation approaches for LLMs
Exploring domain adapting large language models (LLMs) to your specific domain or use case? This 3-part blog post series explains the motivation for domain adaptation and dives deep into various options to do so. Further, a detailed guide for mastering the entire domain adaptation journey covering popular tradeoffs is being provided.
Part 1: Introduction into domain adaptation — motivation, options, tradeoffs Part 2: A deep dive into in-context learning Part 3: A deep dive into fine-tuning — You’re here!
Note: All images, unless otherwise noted, are by the author.
In the previous part of this blog post series, we explored the concept of in-context learning as a powerful approach to overcome the “comfort zone” limitations of large language models (LLMs). We discussed how these techniques can be used to transform tasks and move them back into the models’ areas of expertise, leading to improved performance and alignment with the key design principles of Helpfulness, Honesty, and Harmlessness. In this third part, we will shift our focus to the second domain adaptation approach: fine-tuning. We will dive into the details of fine-tuning, exploring how it can be leveraged to expand the models’ “comfort zones” and hence uplift performance by adapting them to specific domains and tasks. We will discuss the trade-offs between prompt engineering and fine-tuning, and provide guidance on when to choose one approach over the other based on factors such as data velocity, task ambiguity, and other considerations.
Most state-of-the-art LLMs are powered by the transformer architecture, a family of deep neural network architectures which has disrupted the field of NLP after being proposed by Vaswani et al in 2017, breaking all common benchmarks across the domain. The core differentiator of this architecture family is a concept called “attention” which excels in capturing the semantic meaning of words or larger pieces of natural language based on the context they are used in.
The transformer architecture consists of two fundamentally different building blocks. On the one side, the “encoder” block focuses on translating the semantics of natural language into so-called contextualized embeddings, which are mathematical representations in the vector space. This makes encoder models particularly useful in use cases utilizing these vector representations for downstream deterministic or probabilistic tasks like classification problems, NER, or semantic search. On the other side, the decoder block is trained on next-token prediction and hence capable of generatively producing text if used in a recursive manner. They can be used for all tasks relying on the generation of text. These building blocks can be used independently of each other, but also in combination. Most of the models referred to within the field of generative AI today are decoder-only models. This is why this blog post will focus on this type of model.
Fine-tuning leverages transfer learning to efficiently inject niche expertise into a foundation model like LLaMA2. The process involves updating the model’s weights through training on domain-specific data, while keeping the overall network architecture unchanged. Unlike full pre-training which requires massive datasets and compute, fine-tuning is highly sample and compute efficient. On a high level, the end-to-end process can be broken down into the following phases:
Data collection and selection: The set of proprietary data to be ingested into the model needs to be carefully selected. On top of that, for specific fine-tuning purposes data might not be available yet and has to be purposely collected. Depending on the data available and task to be achieved through fine-tuning, data of different quantitative or qualitative characteristics might be selected (e.g. labeled, un-labeled, preference data — see below) Besides the data quality aspect, dimensions like data source, confidentiality and IP, licensing, copyright, PII and more need to be considered.
LLM pre-training usually leverages a mix of web scrapes and curated corpora, the nature of fine-tuning as a domain adaptation approach implies that the datasets used are mostly curated corpora of labeled or unlabelled data specific to an organizational, knowledge, or task-specific domain.
While this data can be sourced differently (document repositories, human-created content, etc.), this underlines that for fine-tuning, it is important to carefully select the data with respect to quality, but as mentioned above, also consider topics like confidentiality and IP, licensing, copyright, PII, and others.
In addition to this, an important dimension is the categorization of the training dataset into unlabelled and labeled (including preference) data. Domain adaptation fine-tuning requires unlabelled textual data (as opposed to other fine-tuning approaches, see the figure 4). In other words, we can simply use any full-text documents in natural language that we consider to be of relevant content and sufficient quality. This could be user manuals, internal documentation, or even legal contracts, depending on the actual use case.
On the other hand, labeled datasets like an instruction-context-response dataset can be used for supervised fine-tuning approaches. Lately, reinforcement learning approaches for aligning models to actual user feedback have shown great results, leveraging human- or machine-created preference data, e.g., binary human feedback (thumbs up/down) or multi-response ranking.
As opposed to unlabeled data, labeled datasets are more difficult and expensive to collect, especially at scale and with sufficient domain expertise. Open-source data hubs like HuggingFace Datasets can be good sources for labeled datasets, especially in areas where the broader part of a relevant human population group agrees (e.g., a toxicity dataset for red-teaming), and using an open-source dataset as a proxy for the model’s real users’ preferences is sufficient.
Recently, synthetic data collection has become more and more a topic in the space of fine-tuning. This is the practice of using powerful LLMs to synthetically create labeled datasets, be it for SFT or preference alignment. Even though this approach has already shown promising results, it is currently still subject to further research and has to prove itself to be useful at scale in practice.
Data pre-processing: The selected data needs to be pre-processed to make it “well digestible” for the downstream training algorithm. Popular pre-processing steps are the following:Quality-related pre-processing, e.g. formatting, deduplication, PII filteringFine-tuning approach related pre-processing: e.g. rendering into prompt templates for supervised fine-tuningNLP-related pre-processing, e.g. tokenisation, embedding, chunking (according to context window)Model training: training of the deep neural network according to selected fine-tuning approach. Popular fine-tuning approaches we will discuss in detail further below are:Continued pre-training aka domain-adaptation fine-tuning: training on full-text data, alignment tied to a next-token-prediction taskSupervised fine-tuning: fine-tuning approach leveraging labeled data, alignment tied towards the target labelPreference-alignment approaches: fine-tuning approach leveraging preference data, aligning to a desired behaviour defined by the actual users of a model / system
Subsequently, we will dive deeper into the single phases, starting with an introduction to the training approach and different fine-tuning approaches before we move over to the dataset and data processing requirements.
In this section we will explore the approach for training decoder transformer models. This applies for pre-training as well as fine-tuning.As opposed to traditional ML training approaches like unsupervised learning with unlabeled data or supervised learning with labeled data, training of transformer models utilizes a hybrid approach referred to as self-supervised learning. This is because although being fed with unlabeled textual data, the algorithm is actually intrinsically supervising itself by masking specific input tokens. Given the below input sequence of tokens “Berlin is the capital of Germany.”, this natively leads into a supervised sample with y being the masked token and X being the rest.