Source: Does Token Oversupply Change? — DBAI class discussion
One of the more practically useful class discussions this semester was about the token oversupply claim: that only 15–18% of available AI inference capacity is being used. If true, this has enormous implications for GPU pricing, data centre economics, and the AI infrastructure investment thesis.
The methodology critique is important. The claim assumes all 9–10.5 million H100-equivalent GPUs globally are running inference workloads 24/7 at theoretical peak throughput of 1,400 tokens per second. In reality, large fractions of those GPUs are used for training, fine-tuning, and R&D. Real-world inference throughput varies enormously. And the consumption figure likely undercounts internal enterprise inference, research usage, and non-API tokens.
The deeper question: whether any level of current utilisation tells you anything meaningful about future demand. The history of compute markets suggests it probably doesn’t. Mainframe utilisation looked low before personal computing arrived. Bandwidth utilisation looked low before streaming video arrived. The question isn’t whether today’s capacity is fully used — it’s whether the applications that will consume it exist yet.
My intuition, based on the agentic AI trajectory, is that the demand side is just getting started. Investors and operators who conflate short-term excess with long-term structural oversupply will make predictably bad decisions.