Andrew Marble
marble.onl
andrew@willows.ai
February 9, 2026
AI is pretty notorious for making great, 80% demos, that never scale to something usable. For the record, I remain an AI bull, obviously it’s not a fad, and has real utility that will be world changing. But it’s also very over-hyped. The cool demo problem is a Ponzi scheme in a sense. Something gets announced, it looks awesome, and then when people start digging in they see all the edge cases where it doesn’t work, and what it would take to get to something productive, and start getting disillusioned. But then there’s an even cooler demo that makes us forget the shortcomings we noticed and get dazzled again. As long as newer, cooler things keep arriving faster than people can critically assess them, the scheme keeps working, but we're paying back past promises with new inventions, and when they stop there will be trouble.
The real unfortunate part is that until they stop, we’re not going to collectively focus on doing something useful with AI, but instead we’ll keep chasing cool demos.
Case in point: recently there has been hype around AI coding agents building big projects autonomously:
Cursor wrote a “from scratch” web browser with AI, using parallel coding agents to generate over a million lines of code in a week: https://cursor.com/blog/scaling-agents
Anthropic used Claude Code agents running in parallel to write a C compiler that can compile the Linux kernel: https://www.anthropic.com/engineering/building-c-compiler This apparently cost $20,000 in token use.
Keeping the analogy, these “cool demos” glossed over existing problems (of which there are many) in AI assisted coding, and showed how LLMs can be orchestrated to write million LOC projects autonomously. Of course they have some rough edges, are slower and buggier than real software, etc. but the principle is there, right?
Browsers and compilers have a lot in common from a vibe coding perspective in that both have detailed existing behavioral specifications. This makes thorough testing comparatively easy because someone has told us exactly how the software should behave. It also allows one to declare victory (“it mostly adheres to the spec”) while glossing over factors that are more important in most real software projects, like taste.
Having seen the above projects, I wanted to try something of my own, and believe it or not, had pretty high expectations, having been wowed by these demos. The project I chose was a Google Docs competitor (I can’t stand Google Docs, but that’s not important here). The project repo is here https://github.com/rbitr/altdocs and a running version is hosted (for now) at https://myaltdocs.com/.
The repo includes the instructions and script to run the coding agent
(here Claude Code / Opus 4.6) so I won’t go into much detail. Briefly, I
followed the Anthropic compiler example linked above, except used only a
single agent. I wrote a short document describing the meta-project of
writing a coding agent that runs in a loop to build a Google Docs clone,
and got Claude code to set up a design document (spec/FEATURES.md), a
prompt (AGENT_PROMPT.md) and a shell script (run_agent.sh) to run
the coding agent in a loop. I kicked this off on a virtual machine, and
ran it for about 8 hours, with a few minor interventions.
The total cost was $170. The logs show about 233 million input and 1.5 million output tokens. I stopped when it looked like all the priority work had been completed (for some value of completed) and I decided I didn’t want to spend any more money. I had planned to spend <$50 but there’s always this gambling feeling where spending a little bit more will result in a big payoff, so I went over budget. I used the API instead of a Claude Code plan – I didn’t want to hit a usage limit, and I also wasn’t sure if it was kosher to use a plan to run an agent in a loop and didn’t want to get banned.
The result is OK. It has all the features I asked for, and includes document sharing, collaborative editing in real time, support for fonts and line spacing, etc. etc. I could not have paid a developer $170 and got this. The problem, of course, is that, while abstractly impressive, this is completely useless, and I see no pathway for it to become useful with more effort.
There are lots of bugs or poor choices depending on how you want to frame them. The whole web page scrolls instead of just the writing area. The table and image functionality are poor (they can’t be positioned or resized, can’t adjust the font in a table cell, …). Bullet points don’t really work. I wanted document editor style margin control and got bulk indent. It added collaboration but no account management or authorization.
We could add all these things, but it doesn’t solve the apparent problem that there is no taste being applied. With a compiler or browser, taste is barely necessary, because the spec is there. With a UX-driven tool like a document editor, there’s no hiding in spec compliance, it’s easy to see if the product is crap or not. And while academically I think it’s very cool that one can use AI to write a functional collaborative rich text editor in a day, this isn’t going to replace Google Docs anytime soon, and I see no evidence that would change if I were to spend 10x or 100x as much.
I anticipate criticism about the setup or the prompting. I’m sure in the space of all prompts, there is one that would have provided a better outcome, the question is how much time must one spend on searching that space, and how much better the outcome. I see nothing to suggest that changing the prompt or architecture (say agents with specialized roles like Cursor used) would bring about a step change that would put this on track to be a competitive piece of software.
Likewise, I could work on improving the spec. This, as already mentioned, would be whack-a-mole, fixing gaps in the current spec will just reveal new ones. Without the required taste in the first place, it’s a losing battle.
The proof is in the pudding, obviously, so if there are agentic coded projects of this nature with materially better outcomes, I would love to see them. Hopefully sharing this will help people who don't want to spend a few hundred dollars get a sense of what to expect.
It might not seem like it, but my original plan was to write a “look at this cool thing I built” article and highlight the good aspects of the project. It was only when I started writing that I realized how disappointed I am and changed tack. None of this means I’m dismissive of AI. There are people who any time AI doesn’t meet the most hyped up predictions, pounce and say things like “let’s be honest, generative AI isn’t going all that well”. I don’t want to give that impression. But it’s important to take a sober look at what the current state is, and how we can shore up weaknesses in the technology to get useful work out of it, instead of just jumping from demo to demo.