Mar 19, 2023
TL;DR: FMs are fabs.
Today's historians of technology view us as being in the “deep learning era,” which started around 2012 and continues to this day. Future historians of technology will see it differently. The deep learning era ended in 2020, when foundation model capabilities took off. Today, we are in the foundation model era.
The foundation model era is characterized by different metrics, models, processes, and economics. The latter is the eventual goal of this article, but before making claims about the macro we should start with the micro. Let's begin with metrics, which are key to understanding a given field's priorities.
In the deep learning era, research advanced via increased performance on established benchmarks like ImageNet. Year after year, we pushed forward performance on known but unsolved tasks. We discovered and engineered thousands of techniques of varying importance, including both new architectures and optimizers as well as new training objectives, dataset augmentations, and problem formulations. Yet nobody (perhaps outside of some corners of the RL community) expected anything much more than increased performance, so performance was a pretty good way to talk about it. A better ImageNet classifier could be expected, on average, to produce better representations for transfer learning, but it wasn't going to spontaneously learn to drive a car.
The FM era is different. Performance remains a meaningful metric -- all the behaviors we observe seem to emerge from lower perplexities on natural language -- but it is no longer decisive, and nobody wants to use these models for simple next-word prediction anyways (except incidentally). The right way to talk about these models is instead in terms of their capabilities.
Early language models could string together a phrase or two. The first FMs could put together a few coherent sentences at a time, but not much more. Recent FMs have tremendously more capabilities, and are capable of writing pages of interesting, novel content on complex subjects in a wide variety of domains. Of course, there are many things they can't do -- complex theorem-proving remains unsolved by even the smartest models. The important point is that, unlike the deep learning models of prior years, these models cannot be reduced to a simple score. One can easily imagine two FMs with the same overall perplexities on natural language, where one had more of PubMed in its training data and the other had more of GitHub. We shouldn't be surprised that the first is a better doctor and the second a better programmer.
Thinking in terms of capabilities is important because it leads us intuitively to the notion that performance can saturate, at least in terms of the value it provides. ChatGPT is, actually, already great at copy-editing. I don't doubt that newer models could be better, technically speaking (though there are still gains to be had in customizing its corrections to how I write). I doubt GPT-4 does particularly better. For me, that is now a solved problem.
Of course, there will be an ever-expanding frontier of value as FMs continue to get smarter. GitHub Copilot is useful because it makes programmers appreciably more productive, but it is hard to imagine we'll run out of software to write. Constructing great software has a near-infinite capacity to consume intelligence and still become appreciably better. We have a long way to go before we saturate demand for AI on software, even if the performance of a two-line autocomplete could saturate reasonably soon.
To summarize: if you want to understand the demand side of FMs, you need to break FMs into their constituent capabilities. For some tasks, demand for better models is unlikely to abate anytime soon. But for others, “good enough,” is actually good enough, so there will be sufficient (and increasing) demand for previous-generation models as their capabilities expand.
Cutting edge FMs are getting more and more expensive to train. GPT-4 is rumored to have 1T parameters, and I would wager it cost tens of millions of dollars to train in compute alone, ignoring both data generation and researcher compensation. And not only are costs increasing, but supply is limited. Only Google and NVIDIA have cost-effective access to enormous volumes of state-of-the-art, AI-optimized compute. Everyone else must pay, and even then there are only so many GPUs and TPUs right now. It is not hard to imagine that in a few years, constructing a state-of-the-art FM could be a billion-dollar endeavor.
At the same time, “last generation” capabilities are being commoditized increasingly quickly. While only OpenAI has a GPT-4 level model, there are now many high-quality ChatGPT competitors: Anthropic's Claude (which outdoes ChatGPT in many meaningful ways), Google's Bard, and Stanford's Alpaca to name a few (though this list is by no means exhaustive.) ChatGPT is still better than these models in many ways. But the gap is closing very quickly, and with efforts underway to efficiently run these models on more commoditized hardware, we should expect prices of last-generation inference to drop very quickly.
This price drop will be highly significant. State-of-the-art models will necessarily be reasonably expensive to offset their tremendous training costs, which will limit them to only the highest-value tasks, like drafting legal documents and writing code. But the capabilities of last-generation models will become completely ubiquitous as prices fall. While today it is hard to imagine using FMs to sift through mountains of low-value unstructured data in search of insights, in the near future this sort of work will be the norm.
In other words, the economics of FMs will be just like semiconductor fabs. State-of-the-art models will be enormously expensive to construct, perhaps someday even rivaling the cost of building modern fabs. Like fabs, they may require national support to build. And as a result, the capabilities of these models, analogously to cutting-edge chips, will be expensive but still worthwhile to widely deploy in high-value applications. But also like fabs, these models will quickly depreciate in value. As ChatGPT's competitors -- and especially its open source ones -- catch up, its value as an asset to OpenAI will rapidly decline. But provided that researchers can continue to push the capabilities of cutting-edge models, and especially if a single company can remain on top, the future will remain bright for them.
There are a few things missing in the ecosystem from the equilibrium. The first is that FM inference is quite a pain, and people use OpenAI's APIs not only because their models are excellent, but also because their APIs are simple and reliable and their documentation clear. We need API providers for commodity models, analogous to OpenAI. Several companies are well positioned to take this role if they want (Cohere, Google, Huggingface, potentially cloud vendors). None seem to have actually taken up the mantle, in part because being a commodity provider is not an inherently fun proposition. I feel this is too harsh on it, though. Reliable, cheap service and high-quality APIs and documentation are easier said than done. Such a position in the market also leaves one well-positioned to build their own state-of-the-art models.
In addition to providing APIs to commodity models, such a company could also provide hosting for existing AI companies. Right now, even though AI companies can access cloud compute without too much start-up capital, it still takes significant effort to deploy models cost-effectively. Allowing a company to upload a PyTorch binary and receive a pay-per-use endpoint to that model could be very valuable indeed. The reason this makes sense is that much of the engineering effort to run AI more quickly is common across different models, allowing greater investment and optimization.
Another thing missing is good benchmarks for different models. Having decent metrics for the capabilities of different models -- potentially even hundreds per model -- combined with a unified API for accessing any and all of them -- could dramatically reshape the competitive landscape of AI.
August 5, 2023
TL;DR: Can somebody please actually compete with OpenAI?
Right now, my company Cofactory uses primarily OpenAI APIs. I desperately want someone to allow me to move away from ChatGPT. Nobody has done a good enough job yet. As far as I can tell, it isn't because it's impossible; the inference providers don't actually know what is needed, or don't know how to build it, or just haven't finished yet.
First, a disclaimer. This addendum is meant to tell you how to conquer the markets of startups, independent developers, researchers, small businesses, and private equity. It does not apply to serving custom models for large enterprises. So, if the latter is your business (@baseten) then feel free to ignore this. I am also (as you will see) highly opinionated. I am at least self-aware of this fact.
But if you are an inference provider which cares about literally any other market then this addendum is a letter to you. Please build me what I want and I will gladly switch to your services straightaway. Because I'm nice, I'll even tell you how to build it! For free!
What I actually want
Here's what I actually want from inference providers.
How to build it
Point 1 is self-explanatory. The main thing to make this work is: don't focus on letting people fine-tune their own models. Instead, get better at doing everything efficiently with prompting.
Point 2 is not particularly self-explanatory. Fortunately, it is symbiotic with point 3, since the lower the entropy of the distribution of model requests, the easier it is to optimize the systems. There are two parts to making a good default. One is easy and the other is hard.
The easy part is to simply fix the marketing. Highlight the default! If you want to provide other models, fine. But show me which one I should start with.
The hard part is to actually RLHF the open-source models effectively. My experience so far with the Llama-2-chat models, for example, is that their RLHF does not have quite the breadth of OpenAI's models, and as a result they are not as good at the long tail of tasks as the closed-source ecosystem. (I have seen them fail at shockingly simple things.) I don't particularly care who actually provides the data or the training or whatever. I don't mean that inference providers need to become experts in training models. But the barrier right now to turnkey operation does not lie in a bottleneck with the base models; it lies in the alignment. Get me Llama-2-chat-improved and I'm happy.
Point 3 is the most complex, but also a place where there is ample room for differentiation. For most of my applications, I actually don't care about another 100ms of latency. I do care about a 10x reduction in cost. Here are the ways I see to drive that.
First, using points (1) and (2) assemble a critical mass of demand onto a single model. You can't do the necessary engineering without scale. To the extent that OpenAI is better at LLM systems right now, I think it is mostly because they operate at much larger scale. That OpenAI has a head start is forgivable; that commodity model providers are actively fragmenting their demand and thus destroying most of their potential cost advantage is not.
Second, batching! I am astonished at how most of the model providers currently can't do this. As a result, I am paying for memory bandwidth instead of compute because I am running at an arithmetic intensity of 2. There is a free factor of 100 in cost savings for pulling this off! Yes it requires more complex infrastructure to route requests to the right GPU. But these are normal systems problems for which there is strong precedent.
Now let's discuss some workload specific stuff that can also really help. Here are the two parts of my workloads that suck on OpenAI right now. (Side note: OpenAI, if you're reading this, and you want to fend off the open-source folks -- be my guest!)
Here's how to fix these issues.
For the first one, let me create a custom prompt and cache those instructions. (Happy to pay for this!) Suppose my initial prompt was 1,000 tokens. Then on Llama-2-70b the KV cache (GQA) takes about 330MB in int8. I recognize this sounds like a lot (and in some ways, it is!) But for the price of a single H100, you can get a few hundred TB of Gen5 NVMe SSD with 12GB/s on each drive. This corresponds to hundreds of millions of cached tokens on a single machine. And with appropriate RAID, there should be plenty of transfer speed to get a whole batch's worth into GPU HBM pretty darn fast. Whether or not modern LLM systems teams have the stomach for these sorts of optimizations, I don't know. But I see no fundamental reason why it can't be done. And the benefit is a much better use of scarce and expensive GPU resources, along with potentially lower latencies due to reuse of these KV caches. Paged attention is likely a precursor to managing this complexity. But if I can go from paying for 1,050 tokens to paying for 50 tokens per query, that's a very compelling value proposition!
For the second one, let me pay to keep the KV cache of that conversation in memory. Specifically, I'd like to set a time out (say, 15 seconds) to send a new query, and as long as I submit by then, I only pay for new tokens. My ballpark figure is that for a DGX H100, accounting for limited utilization one should get around 4,000 TOPs of int8 compute, which corresponds to ~25,000 tokens/sec on Llama-2-70b. This corresponds to about 8 GB/s of KV cache being generated, for a server with at least 640GB of HBM. In short: you can actually afford to keep the KV cache around for at least a few seconds, and of course if you want to cache for longer you can move into either RAM or SSDs, too. And, there are likely further tricks that can be played as well. (Storing the KV cache in int4, for example, buys another factor of 2.) My back-of-the-envelope says that the math actually checks out here, that keeping the KV cache around for a few seconds is genuinely worthwhile.
There are lots of other tricks that one can play, too. One could set up separate expensive-but-low-latency inference endpoints that use extremely lossy optimizations to generate a draft model on-the-fly, and then speculatively decode the main model. One could allow the user to assemble batches of queries to be run all at once -- especially if they all share a common prefix. The list goes on. But for now, I'll end it here.