Source: Wang (2025) — “F®iction in Machines: Accounting Hallucinations of Large Language Models”
The hallucination problem in large language models is well-known in general terms. This paper maps it specifically onto accounting information — querying LLMs about financial statements for SEC-reporting firms — and the results are more nuanced and more concerning than the general framing suggests.
The researcher identifies two distinct types of accounting hallucination. The first is deviation: the model returns a number that exists in the financial statements but gets it wrong — the revenue figure for 2021 is reported incorrectly. The second is fabrication: the model returns a number for a financial item or period that doesn’t exist at all in the reported financials — inventing a figure rather than retrieving or misreporting a real one.
The counterintuitive finding: more accounting information in the public domain leads to fewer deviations but more fabrications. When a firm has more historical filings and more media coverage, LLMs are less likely to get the real numbers wrong — they’ve seen them more often in training. But the same familiarity seems to enable more confident fabrication of nonexistent items, because the model has enough context about the firm to construct plausible-sounding but fictional numbers.
This distinction matters enormously for practitioners. Deviation errors are at least errors in the right direction — the model is trying to retrieve a real number and getting it wrong. Fabrication errors are categorically different: the model is generating a number with no real-world referent. An auditor or investor relying on LLM output for financial data faces both risks simultaneously.
The paper also examines remedies. Web search and file upload — the most common approaches to grounding LLM outputs — are inadequate for addressing the problem. Web search may retrieve the same incorrect information the model was trained on. File upload helps for deviation errors but doesn’t fully eliminate fabrication.
The framework this paper develops for quantifying both types of hallucination is a useful diagnostic tool. Any organisation deploying LLMs for accounting or financial data tasks needs to characterise both error types in their specific use case — and “the model usually gets it right” is not a sufficient validation given the consequences of fabrication errors in financial contexts.