Andrew Marble
marble.onl
andrew@willows.ai
Dec 30, 2023
As a preamble, current investment and expectations of AI have well overshot any real applications. In the last year many of us (me particularly) have been distracted and wowed by amazing demos of things we’d never expected to be possible re language and image generation. But there are still few if any proven commercial applications. It’s easy to find lists of generative AI applications, but I’d argue these are still firmly in the ideas phase (or hype cycle peak of expectations) and not repeatable, commercially viable, plateau of productivity applications. I expect 2024 to still be about hype, but with more real commercial experiments that are actually looking for value and ROI instead of just cool demos.
In those experiments, we’ll see more really bad applications, particularly anything related to customer service, and go through a phase of extreme frustration with companies that try to use AI as a way of denying proper service. And we’ll see a lot of “smoke and mirrors” use cases, that either don’t really depend on AI and could be equally done with a regular computer program, or functionality like LinkedIn’s “draft a message with AI” that lets you create verbose, empty content for things like cover letters or marketing copy. Amidst all the dross, real, valuable applications will emerge, particularly around bulk processing of text and unstructured data and other automation tasks. The hype is going to start becoming unsupportable, and investment is going to flow towards more practical applications and the tools that can provable support them.
My predictions below are based on the idea of AI shifting to the more practical, and a change in how we’re going to look at accountability and compute infrastructure, as well as the continued development of the underlying technology.
Shift in “Accountable AI” from theoretical to practical
When self-driving cars first were being discussed, we saw many articles along the lines of the “trolley problem”: if a self-driving car is heading for an unavoidable accident has to decide between killing one person or another, how does it decide1? It’s worth watching Sebastian Thrun’s response to the question2. Suffice to say, the question is philosophical and of no practical relevance, and misunderstands how self-driving cars work and how they might fail. But it was the kind of pop-ethics question that was trendy to talk about when a new technology was introduced. Years later, we still don’t have self-driving cars outside of narrow circumstances, and the discussions are much more practical, about how to make them work, and not about philosophy problems. The same shift from theoretical to practical is going to happen in relation to generative AI.
There has been an increased focus on what I call accountable AI but people call “AI safety”,
“ethics”, etc. (names I don’t like because they reflect the gap between hypothetical and real I mention below.) A lot of energy has gone into talking about bias (often aimed at protected characteristics like race and sex), malicious use including fraud, misinformation, etc. and in some circles to more esoteric dangers about societal upheaval or AI turning against us, often called “existential risk”.
The existential concerns sort of parallel the trolley problem. They aren’t really about the technology and are more philosophical thought experiments that begin from a starting point that’s impossible to reach. In practice they are absurd and not in line with any reality of current or foreseeable AI systems and are rightly being chased out of the public discourse by common sense. It’s also more or less widely acknowledged that much of the fear around AI misuse or danger was essentially attempts at regulatory capture, incumbents putting up barriers to prevent competition, and tolerance for this behavior is decreasing.
Bias and misuse are more realistic problems but so far have been treated, at least in popular discussion, fairly superficially and theoretically. It feels like there have been lots of attempts, and successes, to get foundation AI models (language or image) to generate offensive, incorrect, or otherwise undesirable output, or to produce output that has a bias in respect of some personal characteristic. For example, I successfully showed that OpenAI’s Dall-e is biased towards generating images of mallard ducks when asked to draw a duck, to the exclusion of other ducks3. While this work has some interesting theoretical value, it’s mostly unrelated to the way models are trained and used, and the findings do not relate to how a system that uses a model will work in most circumstances. Most practical AI systems, especially commercial ones, are not dependent on the raw biases of the underlying foundation model, and dwelling on or changing these is a distraction. Likewise, most practical systems are not concerned with whether or not a model lies, cheats, generates election misinformation, etc. These are really just characteristics of the training data (the internet) that we’ve found a new way to probe.
What I expect to change is a focus on much more practical aspects. What guarantees or assurances do we have that output is correct? Have we masked personally identifying information, competitor’s brands, etc. Is our output structured appropriately for its use as part of a system. Software like Outlines4 that generates output following a specified format, or Guardrails5 that imposes user specified checks and balances on model output, will become more relevant. We’ll see more prompting that allows for correction and reflection, with the raw LLM output never exposed directly without oversight.
I expect the philosophical work to continue, but to become less relevant and further distanced form practice and more applications emerge and people focus efforts on addressing real challenges instead of hypothetical ones.
Simple, faster AI compute infrastructure
The computational requirements for AI are demanding. On one hand, there has been a need for high flexibility in software frameworks because the field is changing rapidly. On the other, this flexibility adds complexity and to some extent prevents optimization that can squeeze the most performance out of hardware. Also, training and inference code has typically been bundled though inference (running the AI system day-to-day) is much simpler.
With respect to inference, what has emerged, and I expect to continue gaining momentum, are simplified inference frameworks, both software and hardware that focus on maximal simplicity and speed for specific applications. On the hardware side, something similar happened with Bitcoin as it gained popularity, where the hardware changed from GPUs to application specific chips (ASICs). We’re seeing companies like etched.ai6 and the mysterious positron.ai7 that have proposed such chips and I expect this niche to grow. In addition to ASICs, we are seeing more and more AI specific chip companies. Groq8 (nothing to do with Twitter’s AI foray) has a chip that allows flexible implementation of LLM inference. Untether AI9, Cerebras10, Tenstorrent11 and many others all have high speed chip offerings. 2024 is going to be where the successful ones start to see wide scale adoption to handle commercial applications.
Not everybody needs the high speed these chips offer and “local” software that uses either CPU or commodity GPUs will also continue to gain popularity. Amongst other reasons, running local models is compelling because it doesn’t require sharing any sensitive data with third parties. This year, llama.cpp12 and Ollama13 rose to popularity as an easy to use frameworks for running LLMs on a home computer. 2023 was the year of llama.cpp because of how well it supports experimentation, as it works out of the box with diverse models and hardware. It’s going to stay relevant but we’ll see more focus on simplicity and dedicated applications as fixed end uses emerge. Pytorch has also acknowledged the need for fast inference14 and is building an edge framework15. Here I’ll shamelessly plug my own llm.f90 inference framework that focuses on minimal, fast CPU (for now) inference, in a very simple open-source script that can be incorporated into other software16.
That’s a good Segway into the other part of software which is programming languages. Python has been and will be the default AI research programming language for the foreseeable future. For optimized single-purpose models, we’re seeing lower level languages become more popular, particularly C/C++, but I’m bullish in the rise of Fortran, which was and is made for the purpose of writing fast numerical code. As a higher level language that’s closer to python for the kind of data manipulation that’s at the core of AI, but as fast as C, it’s well suited to writing AI code. A challenge is the almost complete absence of mindshare and use by newer programmers, which I’m trying to change.
Fundamental Advances in Language Model Architectures
All of the popular language models are based on Attention which “is all you need” according to the famous paper17. Attention looks back at the words (tokens) the model has seen or generated so far, and decides which ones are relevant to generating the next word. The problem is, it scales poorly because it looks at how every token relates to every other one, so the computation involved grows with the square of the number of tokens. For applications like RAG (using an LLM with a custom document set) this limits the practical amount of text a model can look at.
Attention aside, there are also questions of the overall parameter efficiency of models. Current state of the art LLMs are very big, and we’d all be happier if they were smaller.
For 2024, I expect there to be at least another family of competitive models emerge and get real “attention” in commercial use that don’t use attention. Existing examples are RWKV18 that uses a RNN based block – recurrent neural networks are an older kind of architecture that was previously popular for language processing. RWKV reimagines the architecture for use within a LLM. Another example is Mamba19 that uses a (also recurrent) state-space model insider of a recurring block to build an LLM. RetNet20 uses another kind of recurrent network. All of these architectures support longer text without incurring a quadratic penalty like transformers. None have “broken through” to replace or compete with transformer LLMs, but expect this to change, with either these or other new architectures becoming competitive.
…
I’ve tried to be optimistic here and mention what I see as positive developments. As long as AI is attracting a lot of attention, there is going to be controversy, attention seeking, and grift attached to the field as well, and this will continue. Ultimately, the negatives will not affect the course of the technology in the longer term, but we do need to be vigilant about some attempts to shift the balance of power towards special interests. 2023 saw regulatory capture under the guise of safety as a big theme. In 2024, I expect to see more emphasis on claiming copyright entitlements21 and more efforts to redefine open-source in favor of big businesses22, efforts that need to be called out and opposed.
https://www.gsb.stanford.edu/insights/exploring-ethics-behind-self-driving-cars↩︎
https://www.youtube.com/watch?v=nuDIfITdhSc↩︎
https://www.marble.onl/posts/code_of_practice_and_bias.html↩︎
https://github.com/outlines-dev/outlines↩︎
https://github.com/guardrails-ai/guardrails↩︎
https://www.marble.onl/posts/general_technology_doesnt_violate_copyright.html↩︎
https://www.marble.onl/posts/software-licenses-masquerading-as-open-source.html↩︎