Why host your own LLM?

In the Terminator movies, good relationships beat technological superiority. Kyle Reese and Sarah Connor outwit the advanced T-800 who in turn helps Sarah and John beat the ultra-advanced T-1000. OpenAI’s GPT-4 is currently the most advanced publicly available language model. There are also analyses showing it’s generally cheaper to run than self-hosting comparable models. I want to argue that despite everything OpenAI’s models have going for them, it’s worth considering self-hosting anyway, especially if you’re building a product or an internal capability.

If you’re using language models for custom applications, you can use an API from companies like OpenAI or Anthropic, where you submit your prompt, get a response, and pay usage based fees. Or you can configure your own model and host it locally or in the cloud. There are many models available for self-hosting. A few recent analyses have made a case for using OpenAI’s API based on cost and performance¹^,². Very detailed cost calculations are possible, but the obvious cost advantage of a usage based API is that you only pay for the hardware when you use it. Most self-hosted applications will be challenged to get good utilization from dedicated GPUs and so are paying a lot for idle time.

There’s a lot of subtlety in gauging performance – personally I think there’s not a 1:1 relationship between rankings in the various benchmarks and “leaderboards”³ and performance on specific commercially relevant tasks. But GPT-4 is unequivocally better than the rest across a wide range of skills and only the best publicly available models compete with Claude (Anthropic’s model) and GPT-3.5.

Despite the advantages, there is still a compelling case for working with publicly available models. (Note I don’t say open source, many have other limitations that disqualify them from being labeled as such⁴, but I won’t dwell on that here.) To me it boils down to a the “relationship”. Using APIs means you’re a taker of whatever OpenAI et al. are offering. Model features, customizations, values (censorship and world view) etc. are all dictated by those companies. You can only build a front-end. This also means you don’t have access to internal states and so are limited, for example in applying advanced accountability techniques or guardrails on top. All of this could be good, it means you don’t have to worry about it. But it also makes whatever you build utterly dependent on a start-up company.

For “relationship-based” development, there are good reasons to use self-hosted models. Having control over the model architecture and weights removes uncertainty about future changes, and means you don’t have to take what OpenAI decides to give you. There is a rich ecosystem of different models to experiment with, as well as the ability to customize – for example by fine-tuning on your own terms. The construct ultimately lets you build a long-term relationship with your AI model and adapt your product around it, having clarity that what you build is going to keep working with the model that you’ve chosen and giving you control over when and if you decide to make changes. It lets you build something that isn’t just a front-end on somebody else’s language model but is deeply integrated.

Also, for many applications, the well-rounded superiority of a GPT-like model is not what’s driving value. Running a model as big as GPT-4 is potentially $10,000’s per month. But it’s possible to run 7B and 13B models (models with 7 and 13 billion parameters, common sizes for LLaMA and other public models) on a laptop. These models are big enough to perform many tasks competently and can be cost-effective as part of local systems.

“Responsible” use of AI has many meanings. Tech companies have often focused on political correctness and superficial notions of bias, largely to avoid controversy in broadly capable public models like chat-GPT. For many applications, particularly specialized knowledge work⁵, those concerns are mostly irrelevant and give way to real issues about factual accuracy, completeness, or simply staying on-topic. Many techniques for “keeping models in line” require access to internal states, gradients, and intermediate outputs⁶. Using an API-based model limits the kind of experimentation and augmentation that is possible.

The same holds true for various optimizations such as caching internal model states, as well as model fine-tuning. APIs offer options, but they are limited compared to what is available. The technology is evolving so quickly still that new models and techniques are becoming available every day. For those that are using the LLM as a tightly integrated part of a product or tool, the only way to have the flexibility to evolve with the technology is to have a self-hosted model.

An additional aspect of the fast pace of change in language models right now is that the skills and knowledge required to work with the technology are evolving quickly. Working with self-hosted models gives institutional and individual experience in this evolving landscape, in a way that APIs don’t. For professional development of employees as well as adaptability to change, keeping “AI” at a deeper technical level is important for many companies, particularly those that are building applications. It’s not a mature technology, and part of the “moat” that we practitioners have is simply knowing what’s going on. I’ll actually go further and say that any organization making nontrivial use of AI should internally or through advisers have access to some deep knowledge of the technology, not just the API reference, to be able to understand what it fundamentally does best. As AI gets commodified and hyped-up, there often ends up being a big disconnect between what it can do and what it’s used or proposed for.

In a few years, I expect the landscape will look very different – there will be agreed upon things that are critical to be able to do with a model, and APIs will support this. For a new, still experimental, and rapidly evolving technology, real participation requires deep access to the models and code. This doesn’t mean that all companies or products require such access – there are many valuable things that can be built on top of an API and would probably be a waste of time to self-host. But these are different kinds of products.

Back to the Terminator, Reese and the T-800 both built strong relationships that led to their successful completion of their missions. The Skynet-tasked Terminators just went around flexing their superior technological prowess, and it wasn’t enough to win the day. Part of building the relationships is access. I know it’s a silly analogy, but I believe the same is true with these models, it’s about being able to deeply understand the strengths of the tool and built something tightly integrated, and you can’t do that with an API.

https://betterprogramming.pub/you-dont-need-hosted-llms-do-you-1160b2520526 ↩︎
https://www.cursor.so/blog/llama-inference ↩︎
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard ↩︎
http://marble.onl/posts/software-licenses-masquerading-as-open-source.html ↩︎
https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/↩︎
There is a wide range of literature and research into model accountability. For a narrow example, see “The Internal State of an LLM Knows When its Lying” https://arxiv.org/pdf/2304.13734.pdf but also https://arxiv.org/pdf/2307.00175.pdf ↩︎