The Sunday Paper – F(r)iction in Machines: Accounting Hallucinations of Large Language Models

[An aside: I went to a presentation at the Hong Kong Society of Financial Analysts recently to hear about ‘Agentic AI’. The talk was given by a Microsoft representative who was, naturally, praising ‘Copilot’. The audience via their questions could barely suppress their frustration with both AI and their experience with the Copilot. To remind, these are people at the sharp-end of the business, and if they’re struggling… Ms. Microsoft could only offer ‘Iterate, iterate, iterate’ as the best way to make AI more useful as a research tool. ‘How about get the answer right in the first place?’, would have been my push back; however, I thought it best to keep that to myself.]

Albert D. Wang of the University of Texas contributes to the small but growing literature on how LLMs fabricate data and thus must be treated with extreme caution by financial analysts using them, and their output.

‘Hallucinations’ are the big problem and these come in two varieties with regards to accounting data. The first are deviations i.e. wrong answers. The second are fabrications where the machine makes up numbers.

Here’s the bombshell. In this study even after using the best prompts the researcher found 48% deviations and 36% fabrications. Take a moment to think about what that means in a real world context.

It gets worse. Using the established Retrieved Augmented Generation (RAG) approach i.e. simply using a web search or uploading documents for the LLM to parse the web search results were coming out with a 22% deviation and the file uploads were producing 7% deviations. This isn’t even garbage-in-garbage-out. This is good data in garbage out.

The problem is partly the way LLMs work. If they haven’t been trained on the data they will just go for the most probable next item. Fine for language but not for accounting data.

Cut-off dates for training are also a factor and this is all made worse by the fact the big LLM providers won’t open their black-boxes so analysts can go deeper into how the errors are systematized and thereby try to compensate.

Perhaps most damning of all in this work is what the researcher was trying to isolate. If it were some complex interrelationship of accounting variables over time I’d have some sympathy with the LLMs wobbliness. Instead all that was being asked was data on total assets, total liabilities, revenue and net income. It doesn’t get any more basic.

The final word goes to the researcher “These empirical findings suggest that LLMs exhibit uneven knowledge across firms and over time, which raises important implications for their use in financial decision-making and
research.” D’uh!

I have no use for AI other than as search-on-steroids tool in my analysis. The inability to generate, consistently, the right answer to the most basic inquiry is something one wouldn’t tolerate in the most junior of analysts.

Perhaps I just need to ‘iterate’ a bit harder?

You can review the paper in full via the following link F(r)iction in Machines: Accounting Hallucinations of Large Language Models.

Happy Sunday