Developing an understanding of a variety of LLM benchmarks & scores, including an intuition of when they may be of value for your purpose
15 hours ago
It seems that almost on a weekly basis, a new large language model (LLM) is launched to the public. With each announcement of an LLM, these providers will tout performance numbers that can sound pretty impressive. The challenge that I’ve found is that there is a wide breadth of performance metrics that are referenced across these press releases. While there are a few that show up more often than the others, there unfortunately is not simply one or two “go to” metrics. If you want to see a tangible example of this, check out the page for GPT-4’s performance. It references many different benchmarks and scores!
The first natural question one might have is, “Why can’t we simply agree to use a single metric?” In short, there is no clean way to assess LLM performance, so each performance metric seeks to provide a quantitative assessment for one focused domain. Additionally, many of these performance metrics have “sub-metrics” that calculate the metric slightly differently than the original metric. When I originally started performing research for this blog post, my intention was to cover every single one of these benchmarks and scores, but I quickly discovered if I were to do so, we’d be covering over 50 different metrics!
Because assessing each individual metric isn’t exactly feasible, what I discovered is that we can chunk these various benchmarks and scores into categories of what they are generally trying to assess. In the remainder of this post, we will cover these various categories and also provide specific examples of popular metricsthat would fall under each of these categories. The goal of this post is that you can walk away from this post with a general sense of which performance metric you assessing for your specific use case.
The six categories we’ll assess in this post include the following. Please note: there isn’t particularly an “industry standard” on how these categories were created. These categories were created by how I hear them referenced most often:
General knowledge benchmarks