HELL-OF-A-NATION

Mar 14
2 min read

Updated: Apr 1

I was just writing about a new study looking at problems with and universality of AI sycophancy. Another similar study was posted last week to the Computer Science section of arXiv, looking at AI hallucinations. The results are just as staggering.

With the persistent problem of large language models fabricating their own facts, the researcher was considering the reality that "the most common and critical applications of LLMs in the enterprise is answering questions grounded in provided documents" and that such tasks require these tools to evaluate a "given a set of documents, answer questions accurately based on what is in them — and only what is in them."

So, in the real world, how big is the problem of LLMs inventing their own fraudulent data or misattributing information across documents? The researcher tested 35 open-weight LLMs, across seven model families (DeepSeek, GLM, Granite, Llama, MiniMax, and Qwen) using both NVIDIA and AMD hardware, looking at many billions of tokens from real document questions.

They found hardware was not a relevant factor and that no amount of hardware optimization or temperature tuning impacted the results or closed the gap between best and worst models. And, perhaps unsurprisingly, the research also found that the longer and more complicated the question the worse the models performed.

Still, the best-performing, least hallucinatory, models reliably fabricated answers more than 1% of the time, at a shocking rate for systems intended to be reliable and objective, with the shortest (32k) contexts. That's the absolute best possible case, under ideal settings existing in few real-world deployment scenarios. In more typical scenarios, but by what are still considered top-tier models, these systems fabricated results for these briefest queries at rates between 5% and 7% while median models falsified 25% of the time. Yikes! But that's the good bit.

With medium context (128k) the lowest rate of hallucination tripled to above 3%, with only five of 26 tested models able to remain below 10% noisy nonsense. At their worst, dealing with 200k contexts, no model was below 10%. Across all context lengths (32k, 128k, and 200k) the best models hallucinated 11% of the time, with most around 36% to 52% and the worst gifting users a fabrication rate of 72%.

So it looks like a model deemed excellent at mining a document for relevant info cannot necessarily avoiding pitching you fictional details. My take-away is that you should not do as the model-makers suggest and "just give it your documents." What could justify the waste in time and resources running documents through AI even for a summary or key points when there's no way to assess validity without doing the work yourself? No thanks.