2026-03-15

Codegen is not productivity

There is a whole lot to say about generative AI. LLMs generate a bunch of code, this much is certainly true. Should we celebrate that? There is a long tradition of trying to measure software development output, and most of it tells us that lines of code is a poor metric of programmer productivity. I have some thoughts.

I have seen many people talk about the productivity they get from LLMs in terms of the code it generates for them. I have seen claims of 10,000 lines of code in a day or hundreds of thousands of lines in a week; these often seem like brags or at least they are presented positively. I do not believe that LLMs and generative AI change anything fundamental about using lines of code as a measure of output or productivity.

This is a rant. This is what I think about when I hear people talking about lines of code, whether generated by an LLM or pouring forth from human hands. I do not think anyone should celebrate code output.

It was never about writing code quickly

From the preface to the first edition of SICP:

First, we want to establish the idea that a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. Second, we believe that the essential material to be addressed by a subject at this level is not the syntax of particular programming-language constructs, nor clever algorithms for computing particular functions efficiently, nor even the mathematical analysis of algorithms and the foundations of computing, but rather the techniques used to control the intellectual complexity of large software systems.

In other---worse---words, programming is not about writing code that makes the computer do a specific thing, or at least not exclusively or primarily about that. Programming is an exercise in representing abstract ideas and managing complexity while doing that. Programming is as often an exploration of these things as it is an implementation of them.

I will note that none of the ideas below are new or original. I encourage you to check out the appendix of anecdotes and quotes for many takes on this. For just about as long as we have had programming languages, experts and more have argued that code should be thought of as a liability, not an asset; some of the anecdotes are about this and you can find many more online; this is a critical thing to keep in mind.

Code in programming

An average human could probably type about 4,000 lines of code a day. That said, developers do not spend all their time writing code. In fact, developers spend most of their time on activities other than coding.

LOC is a poor predictor of---and is poorly predicted by---other metrics of interest in software development, including defects, effort, and time.

This is important to understand: the generation of code is not the primary work of a programmer by time, nor does the amount of code predict anything useful about the software. And it is doubly important to realize that programmers are not the exclusive participants in the business of software development, whether that be internal solutions, software products for sale, or FLOSS. Code is one component of software development and problem solving, but it does not take the majority of time. Code is not the bottleneck; it never was.

And then, LLMs

The question comes up everywhere in discussion about generative AI. Some people seem to believe the answer is a firm yes. Others consider the very premise of the question to be absurd. Almost no one asks the question directly, but it is embedded in nearly every take on LLMs.

Should we abandon everything we know?

I do not believe the question is absurd to ask. I think it is absurd to implicitly answer the question without saying you are doing so. I believe everyone should explicitly consider it. I do not know that there is a correct answer. You must answer it, though, and know that you are doing so. And it is probably a good idea to share your answer aloud when you talk about AI, yourself.

Here is my take. Programming is still programming, even if the code is generated by an LLM. Much of the work of programming is not about literally getting code into a source file. Some of the work of programming is getting code into a source file. LLM codegen can accelerate the part of programming that is writing code.

There are other major components to programming which I am not ranting about here, so I will leave it to you to consider whether LLMs can help with those other components.

Problems with putting too much emphasis on code

LLMs are the primary generative AI technology being used for programming, so I focus on these.

LLMs constrain us to primarily text for planning, designing, and implementing. The tooling and norms push us to a markdown planning document and then implementation. This forces us into implementation too soon.

LLMs are trained to generate more output to solve a problem. That is what they are best at. Of course they can do otherwise, but their training and all agent harnesses emphasize generating output. It is an interesting philosophical note that even in reducing, an LLM must generate new tokens.

We get to generated assets and code far too quickly. Code is an incredibly high fidelity prototype, but it is expensive (even with LLMs) to change such a prototype. Who among us has not dealt with the pain of a POC pushed to production prematurely? LLMs encourage this! It is much easier to iterate at the design phase, but LLMs are limited in their ability to operate at that phase. The design iteration loop is not nearly as well supported by LLMs generating text, nor is it emphasized in a meaningful way in agent harnesses. Plan mode does not count.

There is huge value in low fidelity prototypes and designs. The value is psychological and practical. We are not attached to a scribble on a whiteboard, a sketch on some scrap paper, or seven circles and boxes we knocked together in Paint while talking. There is no weight to these things either; their very nature tells us they are disposable. There is no confusion that these things are expert-level creations nor any implicit gravitas. Generated artifacts, on the other hand---even if text-based such as ASCII art or a PlantUML diagram---feel more important and final, and like they are worth holding onto. LLMs confuse our well-honed heuristics about inherent quality in things that would take us longer to reproduce by hand, things that appear impressive on the surface. Plan documents and generated artifacts are too concrete: heavy and so much already set. LLMs rush us through design and promise an implementation now! This locks in too much too soon. The very medium that gives us flexibility also fools us and forces our hand.

And somehow, LLMs bring back the false belief that lines of code mean anything. There is one thing that a high line count guarantees: there are more lines of code that can be changed. Once we are in the code, whether directly or through an agent, we have left the realm of the fastest and easiest iterations, design. It is easier to wipe a line from a whiteboard or throw away a piece of paper than it is to change an implemented solution. Incidentally, it is also easier to do those things than to iterate the same ideas in a planning cycle with an agent.

LLMs entice us with code too quickly. We are easily led.

Maintenance and understand-ability; or, understand-ability

While LOC is not a good measure of productivity, it does have a direct impact on maintenance. It is hard to find numbers worth citing for the proportion of time that goes into software maintenance; it is easy to find numbers, but the studies have issues. Nevertheless, some searching indicates---and personal experience supports---that maintenance time comprises the majority of development time in software projects.

Humans and LLMs both share a fundamental limitation. Humans have a working memory, and LLMs have a context limit. The techniques to work with these limitations are quite similar. Nevertheless, no matter the technique, more source code is more difficult to deal with than less. There is more to understand. There are more places for things to interact. It is just plain easier to mess things up with more source code; you must be careful and meticulous.

Even if your inference provider of choice offers models with large context windows, context rot comes for all. Whether you like to anthropomorphize your technology or not, maintenance is an area where humans and LLMs benefit from the same things. One of those things is having fewer lines of code. It is good for all.

There is another consideration, as well. There is a common pattern I have seen in my own work and that of many others. Coding agents bring implementation close to hand, too close as I described above. This manifests a different problem: it is too easy to build bespoke solutions. That may sound like a benefit, and perhaps the entire point of coding agents to you. Let me explain further.

When it is so easy to start implementation, it is easy to forget to search for existing solutions. And I have observed in my own interactions with LLMs---and in others' implementations with agents---a distinct lack of sufficient push-back to use established packages and libraries. This is a compounding problem for understand-ability: the source size increases with implementations of things that could be library calls, and the custom implementation must be understood instead of simply referencing library documentation. This yields strictly more code than is necessary, and requires a closer reading of that code than if it used a well known package.

And this is not even considering the LLM-driven solutions that never even needed to be a software project in the first place.

It is worth considering productivity including maintenance, not just vibe-time to first implementation. And with the observation that agent-driven development leads disproportionately to build-over-"buy" decisions, we must consider whether unnecessary solutions delivered quickly count as productivity gains.

Collaboration; or, other humans

I contend that an increased volume of code and pace of change hurts collaboration. I admit that there may be counter-balancing benefits. It has been observed that code is read much more often than it is written, so it is probably worth optimizing for reading, where less is so much more.

LLMs seem mostly to be pitched as---and experience reports I have seen demonstrate---personal productivity enhancements. A big part of collaboration in programming is reading. After all, as observed above, code is primarily a medium for human understanding and incidentally for machine execution. We read each other's code continuously. Good software development practice demands that we peer review every line of code before shipping it. It does not matter how quickly code was generated when it comes time to read and review it. The speed that matters is humans' in review. Less code is more.

And no, Opus reviewing Gemini's code does not count; only when someone from your inference provider takes responsibility for your on-call shift do they get to own code review. We collaborate with one another on more than just getting code into source files. Let me be clear: if I am on the line for production up-time (and I am), then I am personally responsible for every single line of code that could affect that. It does not matter if I used an LLM to help or not; I am responsible for my contributions. And you are responsible for yours. If I wake up to downtime because you did not worry about reviewing what your LLM generated, I will not be happy. "The LLM said it was good," does not recover downtime, nor does it keep me sleeping like a baby.

I do not see much about collaboration from those spouting the gospel of LLM.

There is yet another aspect of collaboration un-addressed. I am engaged in regular dialogue with customers. Customers pay good money for products, and the work of providing products and incorporating feedback is collaboration. If I want customers' trust---and let us be blunt, their money---I must be able to make assertions about the product; simple things such as what it does, how it does it, what they can expect to work, what is coming in the future, and how to deal with any errors in the application. Customer support where I work flows either directly or pretty damn quickly to developers. If the code was written by an LLM without a human understanding it, then this support channel turns into chatbot support by another name: slower and more effortful, but chatbot support nonetheless. And I can tell you with certainty that customers are grateful to get capable humans when they need support. Customers paying good money deserve to get support from a human when they need it.

Conclusion

This is a rant; there is no conclusion.

Maybe ask some questions:

what do LLMs provide?
(how) should productivity be measured?
do LLMs improve this measure?
what is the cost of an LLM? (Nota bene: the answer to this one should not be in any denomination of currency)
is the value of the LLM worth the cost?

Gratitude

Big thanks to my test readers, Johnny Winter, Gilbert Quevauvilliers, Eugene Meidinger, Bernat Agulló, Daniil Maslyuk, Daniel Marsh-Patrick, and Alex Barbeau. Any errors are, of course, my own.

You should read all of these things

Some may take longer than others.

Typing is not a programming bottleneck
On the cruelty of really teaching computer science, or if you prefer the superior experience: in the man's own writing
Rethinking Productivity in Software Engineering
Structure and Interpretation of Computer Programs

Appendix one

These are all appeals to authority. That is why they are in an appendix, and not part of the rant itself. Most of these are authorities that you should listen to, though.

"My point today is that, if we wish to count lines of code, we should not regard them as 'lines produced' but as 'lines spent': the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.": Dijkstra
"One of my most productive days was throwing away 1,000 lines of code": Ken Thompson
"Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?": Kernighan's law
"Basically, lines-of-code is a completely bogus metric for anything": Linus Torvalds
-2000 Lines of Code
"Measuring programming progress by lines of code is like measuring aircraft building progress by weight": Bill Gates
"I hate code, and I want as little of it as possible in our product": Jack Diederich: Stop Writing Classes
"The most valuable tools in an AI-assisted workflow aren’t the ones that generate the most code, but the ones that constrain it correctly": Anders Hejlsberg
What a programmer does

And a recurring theme in my analytics consulting work:

I often get called in for gorgeously gnarly questions of performance or correct behavior, or both. I typically work on the things that others have tried and failed at, repeatedly. I spend so much more time asking questions such as, "Why," "What is that," and, "Why is it that," than I ever spend writing code. Invariably, after asking such questions for hours, the number of lines of code I have had to write is measured in single digits. I often get to delete code counted in tens or hundreds of lines, and in a couple of glorious cases, thousands of lines in a single measure. More often than writing new implementation code, the solution is to add a dimension or a data structure, rather than write any new implementation code.

The hard problems I find in consulting and in programming generally are not questions of implementation, rather they are questions of questions. A well formulated question is often already halfway answered. Understanding the problem domain well enough to know what questions need answers is the hard part. Forming questions and requirements well is the hard part. Writing the code is often a formality once the domain is well understood, the questions are well formed, and the requirements are clear.

Appendix: how I use LLMs

Though this is irrelevant to the article, I cannot help but presume many people will care about this. My experience and my work have nothing to do with the content above.

I use generative AI daily in my work and have for the better part of a year, shipping major new features to multiple producs. My job is to deliver software products and systems that other people use. Someone has to operate, maintain, and extend those software products; often that someone is me. I have unlimited access to frontier models from major labs and many open weight models. I consider it important to share that I no longer have any relationship with OpenAI, nor will I again. I have Claude Code and other harnesses. This is not to brag. I am lucky to work in organizations that pass Joel's test.

This is not a rant bred in the brain of a Luddite. I have exposure to and experience with the subjects I discuss.

My use of LLMs can be broken down into four major categories:

Category	Approximate quantity of tasks	Approximate time spent in use case
rubber duck planning/design	35	15
better search	25	5
digest docs for examples	25	5
generating code: edit or new	15	75

Rubber duck planning and design is everything you would expect from spec-driven development with agent harnesses. I also use it for exploring architectural ideas, researching available libraries and prior art in a given space.

Better search is pretty self explanatory. Given some project context, LLMs can do a quite good job of being a little research helper. I have found that it is necessary to be very explicit about preferred sources, and that I need to ask for direct links and citations. For anything beyond the most trivial, I mostly use the LLM to find things for me to read, rather than use it to digest or interpret.

Digesting docs for examples is primarily when I am using some new library or a construct I am not familiar with. It is helpful to get contextual examples of how something would fit into my existing code. Usually LLMs are pretty good at basic examples. I have found that nuances of semantics for languages other than C# and Python often escape every LLM, even when reading language docs.

Generating code is exactly what it sounds like. And boy howdy, can these things put text into a source file! I have absolutely no trouble believing people who claim that they generate 10,000 lines of code in a day with an LLM. The problem is this: I want absolutely nothing to do with the code that these things generate, or at least not without a massive number of improvements.

Here is a quick list of the various things I need to fix all the time in LLM-generated code:

copying code rather than reusing existing abstractions
deleting tests or rewriting them to not test the right thing
misunderstanding dependencies and invariants
writing implementations that hard-code test cases and return only the tested values
preserving test cases for code that has been deleted, trying to make them pass
design decisions that are routinely the opposite of what I would make, even with specs and context built up over days
an absolute incapability to identify opportunities for abstraction and well-typed solutions without exact prompting for the abstraction to use
failure to build interfaces that are consistent and reasonable:
- inconsistent argument ordering
- inconsistent naming conventions with other parts of the code base
- routinely writing code that can only be called if you already know how it is written
not following my guidance on style, approach, architecture
not following authoritative guidance on how to use a library
implementing shitty facsimiles where they should use a battle-tested library
using a whole framework where they could write a single short function
outright refusal to build upon established abstractions, instead glomming new code on, rather than using or extending core components of a solution

These are not specific to any LLM, harness, or environment. These issues have come up for me with C#, F#, elisp, DAX, M, Bash, Ruby, Python, and OCaml. These issues occur in domains ranging from dimensional modeling, data engineering, parsers, compilers, writing CLIs, ASP.NET, general web programming, basic system automation, and building system daemons. An LLM can give a perfectly lucid analysis of an architecture, library, individual function, or data structure; it can identify extension points and places that might need improvement; it can find limits and give a detailed analysis of how to integrate new functionality. After they do all that, they end up doing something that looks like this.

I will note that all of these failure modes end up yielding more code than is necessary. All of my fighting with LLMs is to get them to write less code.

When I deal with LLM-generated code, I have a bad time.

I am mostly frustrated when using LLMs for code generation. It often feels futile to get an LLM to generate code that I would accept as a PR reviewer. I constantly question whether they are a useful tool for the programming I do.

I have to tell you, the value I get from LLMs does not come in any way from getting code into a source file faster.