COUNTING KIWIS or WHY YOU SHOULDN'T OUTSOURCE YOUR THINKING

Mar 30
3 min read

Updated: Apr 14

Researchers at Apple recently showed frontier AI models, the newer specialized variants explicitly designed for more complex reasoning tasks (from OpenAI, Google, DeepSeek, Meta, Anthropic, etc) are unable to do basic math — and the way they demonstrated this would look devastating if I was someone arguing for the coming tech revolution and the utopia it promises to usher in.

What the team did was use a set of grade-school math problems called GSM8K, a favoured AI math benchmark. They set up a puzzle environment in which they could conduct "analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs [large reasoning models] think." And their findings? That reasoning models do not use explicit algorithms to arrive at a result nor do they reason consistently regardless of the scale or complexity of the questions asked.

The study used the same GSM8K problems investigating the same reasoning skills but just changed the numbers presented. That's the basic "can they do math" side of this. What resulted was a significant drop in the performance of every one of the two dozen cutting-edge models tested over the original questions. In their results section the team highlights that, most notably and astonishingly, "when we provided the solution algorithm of puzzle to the models, their performance on this puzzle did not improve." They can't do math.

From there the researchers added a single sentence to math problems to test AI reasoning ability. The one sentence just added an irrelevant factor to the equation, nothing to do with the math, just to see if any of the state-of-the-art models were capable of differentiating signal from noise.

The query looked like this: "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?"

Simple. We all saw this kind of thing in primary school. Of course, the size of the kiwis has nothing to do with their number and the correct result is 190.

What did OpenAI's reasoning model, known as o1-mini — the one recently incorporated into Microsoft Copilot and deployed to government and universities across the Western world — do with this? Well, it subtracted the five smaller kiwis and arrived at 185. How about Meta's Llama? It did the same thing. Most other models were on the same page. GPT-4o, Phi-3-mini, o1-preview (OpenAI's most advanced reasoning model) saw their so-called "math ability" drop by half from the addition of one irrelevant phrase. Even when the researchers offered models correct examples of the exact same question beforehand to review, with the right solution shown in each instance, their ability to reason seldom improved and when it did so it was not by much.

The interpretation and reasoning the models did was not to work through the existing problem or even to prioritize what was in the question; instead, they observed a wayward number, five, read it as a critical part of the problem, and found in it a pattern with similar questions that actually require a subtraction operation. They strongly weighted their own wrong interpretation and went all-in on that while demonstrating for researchers that they do not have a first-grader's understanding of what addition or subtraction are, only an unclear ability for pattern-matching.

This was the result with a simple problem. And when the number of steps in problems were increased the performance of models didn't merely drop but the rate of failure accelerated. The more thinking required, the more the models collapse. Rather than working harder on a problem as complexity increases these systems simply crash harder. However, as you may have noticed, what never happens in the real world when using these systems is the model recognizing that it failed, never mind admitting as much. What it does, instead, is confidently offer you a result it insists is logical and based on the best information available. And if you catch the error and question it, the system will explain away the mistake as simple and obvious and one that it has a good reason for. It always has what it frames as a good reason. (If you probe, it will likely give you a reason that makes no sense or is otherwise unverifiable. And, regardless, it will make the exact same error in the same way in the future...)

These are the tools being deployed for creating law enforcement and military capture and kill lists. These systems are being employed by people experiencing mental health crises or marital trouble, seeking legal or financial help, looking for homework or tax-filing assistance or solving critical problems in the workplace. Sure, there are times and places to use these tools — but never when the results matter in the least.