What We Learned from a Year of Building with LLMs (Part III): Strategy

We previously shared our insights on the tactics we have honed while operating LLM applications. Tactics are granular: they are the specific actions employed to achieve specific objectives. We also shared our perspective on operations: the higher-level processes in place to support tactical work to achieve objectives.

But where do those objectives come from? That is the domain of strategy. Strategy answers the “what” and “why” questions behind the “how” of tactics and operations.

Learn faster. Dig deeper. See farther.

We provide our opinionated takes, such as “no GPUs before PMF” and “focus on the system not the model”, to help teams figure out where to allocate scarce resources. We also suggest a roadmap for iterating towards a great product. This final set of lessons answers the following questions:

Building vs. Buying: When should you train your own models, and when should you leverage existing APIs? The answer is, as always, “it depends”. We share what it depends on.Iterating to Something Great: How can you create a lasting competitive edge that goes beyond just using the latest models? We discuss the importance of building a robust system around the model and focusing on delivering memorable, sticky experiences.Human-Centered AI: How can you effectively integrate LLMs into human workflows to maximize productivity and happiness? We emphasize the importance of building AI tools that support and enhance human capabilities rather than attempting to replace them entirely.Getting Started: What are the essential steps for teams embarking on building an LLM product? We outline a basic playbook that starts with prompt engineering, evaluations, and data collection.The Future of Low-Cost Cognition: How will the rapidly decreasing costs and increasing capabilities of LLMs shape the future of AI applications? We examine historical trends and walk through a simple method to estimate when certain applications might become economically feasible.From Demos to Products: What does it take to go from a compelling demo to a reliable, scalable product? We emphasize the need for rigorous engineering, testing, and refinement to bridge the gap between prototype and production.

To answer these difficult questions, let’s think step by step…

Strategy: Building with LLMs without Getting Out-Maneuvered

Successful products require thoughtful planning and tough prioritization, not endless prototyping or following the latest model releases or trends. In this final section, we look around the corners and think about the strategic considerations for building great AI products. We also examine key trade-offs teams will face, like when to build and when to buy, and suggest a “playbook” for early LLM application development strategy.

No GPUs before PMF

To be great, your product needs to be more than just a thin wrapper around somebody else’s API. But mistakes in the opposite direction can be even more costly. The past year has also seen a mint of venture capital, including an eye-watering six billion dollar Series A, spent on training and customizing models without a clear product vision or target market. In this section, we’ll explain why jumping immediately to training your own models is a mistake and consider the role of self-hosting.

Training from scratch (almost) never makes sense

For most organizations, pre-training an LLM from scratch is an impractical distraction from building products.

As exciting as it is and as much as it seems like everyone else is doing it, developing and maintaining machine learning infrastructure takes a lot of resources. This includes gathering data, training and evaluating models, and deploying them. If you’re still validating product-market fit, these efforts will divert resources from developing your core product. Even if you had the compute, data, and technical chops, the pretrained LLM may become obsolete in months.

Consider the case of BloombergGPT, an LLM specifically trained for financial tasks. The model was pretrained on 363B tokens and required a heroic effort by nine full-time employees, four from AI Engineering and five from ML Product and Research. Despite this effort, it was outclassed by gpt-3.5-turbo and gpt-4 on those financial tasks within a year.

This story and others like it suggests that for most practical applications, pretraining an LLM from scratch, even on domain-specific data, is not the best use of resources. Instead, teams are better off fine-tuning the strongest open-source models available for their specific needs.

Don’t fine-tune until you’ve proven it’s necessary

For most organizations, fine-tuning is driven more by FOMO than by clear strategic thinking.

Organizations invest in fine-tuning too early, trying to beat the “just another wrapper” allegations. In reality, fine-tuning is heavy machinery, to be deployed only after you’ve collected plenty of examples that convince you other approaches won’t suffice.

A year ago, many teams were telling us they were excited to fine-tune. Few have found product-market fit and most regret their decision. If you’re going to fine tune, you’d better be really confident that you’re set up to do it again and again as base models improve—see the “The model isn’t the product” and “Build LLMOps” below.

When might fine-tuning actually be the right call? If the use-case requires data not available in the mostly-open web-scale datasets used to train existing models—and if you’ve already built an MVP that demonstrates the existing models are insufficient. But be careful: if great training data isn’t readily available to the model builders, where are you getting it?

Ultimately, remember that LLM-powered applications aren’t a science fair project, investment in them should be commensurate with their contribution to your business’ strategic objectives and its competitive differentiation.

Start with inference APIs, but don’t be afraid of self-hosting

With LLM APIs, it’s easier than ever for startups to adopt and integrate language modeling capabilities without training their own models from scratch. Providers like Anthropic, and OpenAI offer general APIs that can sprinkle intelligence into your product with just a few lines of code. By using these services, you can reduce the effort spent and instead focus on creating value for your customers—this allows you to validate ideas and iterate towards product-market fit faster.

But, as with databases, managed services aren’t the right fit for every use case, especially as scale and requirements increase. Indeed, self-hosting may be the only way to use models without sending confidential/private data out of your network, as required in regulated industries like healthcare and finance, or by contractual obligations or confidentiality requirements.

Furthermore, self-hosting circumvents limitations imposed by inference providers, like rate limits, model deprecations, and usage restrictions. In addition, self-hosting gives you complete control over the model, making it easier to construct a differentiated, high quality system around it. Finally, self-hosting, especially of finetunes, can reduce cost at large scale. For example, Buzzfeed shared how they finetuned open-source LLMs to reduce costs by 80%.

Iterate to something great

To sustain a competitive edge in the long run, you need to think beyond models and consider what will set your product apart. While speed of execution matters, it shouldn’t be your only advantage.

The model isn’t the product, the system around it is

For teams that aren’t building models, the rapid pace of innovation is a boon as they migrate from one SOTA model to the next, chasing gains in context size, reasoning capability, and price-to-value to build better and better products.

This progress is as exciting as it is predictable. Taken together, this means models are likely to be the least durable component in the system.

Instead, focus your efforts on what’s going to provide lasting value, such as:

Evaluation chassis: To reliably measure performance on your task across modelsGuardrails: To prevent undesired outputs no matter the modelCaching: To reduce latency and cost by avoiding the model altogetherData flywheel: To power the iterative improvement of everything above

These components create a thicker moat of product quality than raw model capabilities.

But that doesn’t mean building at the application layer is risk-free. Don’t point your shears at the same yaks that OpenAI or other model providers will need to shave if they want to provide viable enterprise software.

For example, some teams invested in building custom tooling to validate structured output from proprietary models; minimal investment here is important, but a deep one is not a good use of time. OpenAI needs to ensure that when you ask for a function call, you get a valid function call—because all of their customers want this. Employ some “strategic procrastination” here, build what you absolutely need, and await the obvious expansions to capabilities from providers.

Build trust by starting small

Building a product that tries to be everything to everyone is a recipe for mediocrity. To create compelling products, companies need to specialize in building memorable, sticky experiences that keep users coming back.

Consider a generic RAG system that aims to answer any question a user might ask. The lack of specialization means that the system can’t prioritize recent information, parse domain-specific formats, or understand the nuances of specific tasks. As a result, users are left with a shallow, unreliable experience that doesn’t meet their needs.

To address this, focus on specific domains and use cases. Narrow the scope by going deep rather than wide. This will create domain-specific tools that resonate with users. Specialization also allows you to be upfront about your system’s capabilities and limitations. Being transparent about what your system can and cannot do demonstrates self-awareness, helps users understand where it can add the most value, and thus builds trust and confidence in the output.

Build LLMOps, but build it for the right reason: faster iteration

DevOps is not fundamentally about reproducible workflows or shifting left or empowering two pizza teams—and it’s definitely not about writing YAML files.

DevOps is about shortening the feedback cycles between work and its outcomes so that improvements accumulate instead of errors. Its roots go back, via the Lean Startup movement, to Lean manufacturing and the Toyota Production System, with its emphasis on Single Minute Exchange of Die and Kaizen.

MLOps has adapted the form of DevOps to ML. We have reproducible experiments and we have all-in-one suites that empower model builders to ship. And Lordy, do we have YAML files.

But as an industry, MLOps didn’t adapt the function of DevOps. It didn’t shorten the feedback gap between models and their inferences and interactions in production.

Hearteningly, the field of LLMOps has shifted away from thinking about hobgoblins of little minds like prompt management and towards the hard problems that block iteration: production monitoring and continual improvement, linked by evaluation.

Already, we have interactive arenas for neutral, crowd-sourced evaluation of chat and coding models—an outer loop of collective, iterative improvement. Tools like LangSmith, Log10, LangFuse, W&B Weave, HoneyHive, and more promise to not only collect and collate data about system outcomes in production, but also to leverage them to improve those systems by integrating deeply with development. Embrace these tools or build your own.

Don’t build LLM features you can buy

Most successful businesses are not LLM businesses. Simultaneously, most businesses have opportunities to be improved by LLMs.

This pair of observations often misleads leaders into hastily retrofitting systems with LLMs at increased cost and decreased quality and releasing them as ersatz, vanity “AI” features, complete with the now-dreaded sparkle icon. There’s a better way: focus on LLM applications that truly align with your product goals and enhance your core operations.

Consider a few misguided ventures that waste your team’s time:

Building custom text-to-SQL capabilities for your business.Building a chatbot to talk to your documentation.Integrating your company’s knowledge base with your customer support chatbot.

While the above are the hellos-world of LLM applications, none of them make sense for virtually any product company to build themselves. These are general problems for many businesses with a large gap between promising demo and dependable component—the customary domain of software companies. Investing valuable R&D resources on general problems being tackled en masse by the current Y Combinator batch is a waste.

If this sounds like trite business advice, it’s because in the frothy excitement of the current hype wave, it’s easy to mistake anything “LLM” as cutting-edge, accretive differentiation, missing which applications are already old hat.

AI in the loop; humans at the center

Right now, LLM-powered applications are brittle. They required an incredible amount of safe-guarding, defensive engineering, and remain hard to predict. Additionally, when tightly scoped these applications can be wildly useful. This means that LLMs make excellent tools to accelerate user workflows.

While it may be tempting to imagine LLM-based applications fully replacing a workflow, or standing in for a job-function, today the most effective paradigm is a human-computer centaur (c.f. Centaur chess). When capable humans are paired with LLM capabilities tuned for their rapid utilization, productivity and happiness doing tasks can be massively increased. One of the flagship applications of LLMs, GitHub CoPilot, demonstrated the power of these workflows:

“Overall, developers told us they felt more confident because coding is easier, more error-free, more readable, more reusable, more concise, more maintainable, and more resilient with GitHub Copilot and GitHub Copilot Chat than when they’re coding without it.” – Mario Rodriguez, GitHub

For those who have worked in ML for a long time, you may jump to the idea of “human-in-the-loop”, but not so fast: HITL Machine Learning is a paradigm built on Human experts ensuring that ML models behave as predicted. While related, here we are proposing something more subtle. LLM driven systems should not be the primary drivers of most workflows today, they should merely be a resource.

By centering humans, and asking how an LLM can support their workflow, this leads to significantly different product and design decisions. Ultimately, it will drive you to build different products than competitors who try to rapidly offshore all responsibility to LLMs; better, more useful, and less risky products.

Start with prompting, evals, and data collection

The previous sections have delivered a firehose of techniques and advice. It’s a lot to take in. Let’s consider the minimum useful set of advice: if a team wants to build an LLM product, where should they begin?

Over the last year, we’ve seen enough examples to start becoming confident that successful LLM applications follow a consistent trajectory. We walk through this basic “getting started” playbook in this section. The core idea is to start simple and only add complexity as needed. A decent rule of thumb is that each level of sophistication typically requires at least an order of magnitude more effort than the one before it. With this in mind…

Prompt engineering comes first

Start with prompt engineering. Use all the techniques we discussed in the tactics section before. Chain-of-thought, n-shot examples, and structured input and output are almost always a good idea. Prototype with the most highly capable models before trying to squeeze performance out of weaker models.

Only if prompt engineering cannot achieve the desired level of performance should you consider fine-tuning. This will come up more often if there are non-functional requirements (e.g., data privacy, complete control, cost) that block the use of proprietary models and thus require you to self-host. Just make sure those same privacy requirements don’t block you from using user data for fine-tuning!

Build evals and kickstart a data flywheel

Even teams that are just getting started need evals. Otherwise, you won’t know whether your prompt engineering is sufficient or when your fine-tuned model is ready to replace the base model.

Effective evals are specific to your tasks and mirror the intended use cases. The first level of evals that we recommend is unit testing. These simple assertions detect known or hypothesized failure modes and help drive early design decisions. Also see other task-specific evals for classification, summarization, etc.

While unit tests and model-based evaluations are useful, they don’t replace the need for human evaluation. Have people use your model/product and provide feedback. This serves the dual purpose of measuring real-world performance and defect rates while also collecting high-quality annotated data that can be used to finetune future models. This creates a positive feedback loop, or data flywheel, which compounds over time:

Human evaluation to assess model performance and/or find defects

Use the annotated data to finetune the model or update the prompt

For example, when auditing LLM-generated summaries for defects we might label each sentence with fine-grained feedback identifying factual inconsistency, irrelevance, or poor style. We can then use these factual inconsistency annotations to train a hallucination classifier or use the relevance annotations to train a reward model to score on relevance. As another example, LinkedIn shared about their success with using model-based evaluators to estimate hallucinations, responsible AI violations, coherence, etc. in their write-up

Source link