Why arithmetic models look dumb long after they’ve learned the rule

An experiment in memorization, grokking, and misleading loss curves

This post documents an experiment that didn’t go the way I expected.
What started as a simple attempt to observe memorization and grokking in arithmetic models turned into a deeper lesson about how misleading loss curves can be — especially for algorithmic tasks.


What I expected to see

I went into this experiment with a fairly standard mental model of how learning unfolds in neural networks, particularly for structured, rule-based problems.

The story usually goes like this:

  • Training loss collapses first, as the model memorizes patterns in the training data.
  • Evaluation loss remains high, because the model fails on unseen examples.
  • After sufficient training, evaluation loss suddenly drops, signaling grokking — the moment the model discovers the underlying rule and generalizes.

Arithmetic felt like an ideal testbed for observing this behavior. The rules are precise. The outputs are unambiguous. There’s nowhere to hide behind semantics or subjective interpretation.

I expected the training loss curve to tell a clean, familiar story.

That’s not what happened.


The experiment (briefly)

I fine-tuned a small transformer (DistilGPT-2) on synthetic arithmetic tasks:

  • Operations: addition, subtraction, multiplication, division
  • Inputs: structured text (e.g., DIVIDE 845 79)
  • Outputs: exact numerical answers
  • Training setup:
    • Constant learning rate
    • Weight decay
    • Long-horizon training (500+ epochs)
    • Standard cross-entropy loss

The goal wasn’t performance or benchmarking. It was to observe learning dynamics in a controlled setting.


The first surprise: loss curves didn’t behave as expected

Training and evaluation loss over long-horizon fine-tuning. Evaluation loss improves dramatically while training loss declines slowly, challenging the usual interpretation of memorization preceding generalization.

What I saw early on was confusing:

  • Training loss decreased slowly and stubbornly.
  • Evaluation loss dropped dramatically — early and consistently.
  • Gradients were stable, learning rate constant, and training entirely well-behaved.

At first glance, this didn’t make sense.

If training loss represents memorization, and evaluation loss represents generalization, how could the model be generalizing before it memorized?

The loss curves suggested confusion.
But they weren’t unstable or noisy — they were calmly telling a story I didn’t yet understand.


When loss curves stopped explaining what was happening

Trying to reason purely from the curves got me nowhere.

So I stopped staring at the loss and started looking directly at the model’s outputs.

That’s where things finally clicked.


Looking at predictions changed everything

OperationInputExpected OutputModel OutputObservation
DIVIDE845 ÷ 7910.6962025310.69620253Correct to several decimals
DIVIDE356 ÷ 9360.380341880.3165618Numerically close, wrong digits
SUM568 + 390958958Exact match
MULTIPLY974 × 9158912108912101025Correct structure, failed termination
SUBTRACT190 − 498-308-308781889Correct sign, magnitude explosion
Many failures reflect partial algorithmic correctness rather than random guessing — behavior invisible to token-level loss.

Once I inspected individual predictions, clear patterns emerged.

The model was often:

  • Applying the correct operation
  • Producing values with the right order of magnitude
  • Generating numerically close answers, especially for division

But the answers were still wrong.

And in arithmetic, close is indistinguishable from incorrect.

One extra digit.
One misplaced decimal.
One failure to stop generation at the right time.

From the perspective of cross-entropy loss, these are complete failures.

Behaviorally, they tell a very different story.

The model almost always applies the correct operation, even when the answer is wrong.


Arithmetic is unforgiving — and loss reflects that

Arithmetic exposes a fundamental weakness in token-level loss metrics:

  • Partial correctness is invisible.
  • Near-correct answers receive the same penalty as wildly incorrect ones.
  • One wrong digit invalidates the entire sequence.

This is especially severe for division. Predicting 0.316 instead of 0.380 is numerically close, but token-wise it’s treated as fully wrong.

This leads to an important realization:

The model learned the rule long before it learned precision.

Loss does not acknowledge that.


A reframing that made things click

At some point, I wrote this down while reviewing outputs:

“I know how to compute, but I haven’t learned the termination constraint.”

That single sentence explained many of the failures I was seeing:

  • Correct digit-level computation
  • Correct sign and scale
  • Failure to stop at the right point
  • Extra digits appended to otherwise correct answers

This isn’t random guessing.
It’s incomplete algorithm execution.

Which leads to the core insight of this experiment:

The model didn’t fail to learn arithmetic — our metrics failed to notice when it did.


Loss vs learning

This experiment forced me to rethink what loss curves are actually measuring.

The distinction that matters is this:

Grokking is about representations.
Loss is about tokens.

For arithmetic tasks, these two can drift far apart.

That’s why evaluation loss can improve early, training loss can remain high, and the model can look “dumb” long after it has learned the underlying rule.

Put more bluntly:

Loss curves can lie badly for algorithmic tasks.


The wrong question — and the right one

Early on, I kept implicitly asking:

Is the answer exactly right?

But that turned out to be the wrong question.

The more informative question is:

Is the model executing the correct procedure?

Once I started evaluating predictions with that lens — including decimal-aware comparisons for division — the behavior made sense.


What this does (and does not) claim

This experiment does not prove that:

  • The model “understands” arithmetic in a human sense
  • Grokking has already occurred
  • Loss functions are fundamentally broken

What it does suggest is more subtle:

Arithmetic models don’t fail because they haven’t learned the rule — they fail because our metrics demand perfection before acknowledging understanding.

That distinction matters, especially when studying learning dynamics.


What happens next

Training is still ongoing beyond 500 epochs.

If classical grokking appears, I expect to see:

  • A sharp collapse in training loss
  • A discrete jump in exact-match accuracy
  • The disappearance of termination and precision errors

Or it may not happen at all.

Either outcome is informative.

For now, the most interesting result isn’t whether grokking eventually happens —
it’s how much learning can occur before our metrics notice it.

Published by Sam Banerjee

I’m an AI and software engineering consultant who helps organizations design and deliver production-ready AI systems. I specialize in translating complex machine learning ideas into scalable, reliable solutions that work under real-world constraints. My focus spans AI architecture, applied ML, and system design, with an emphasis on model behavior, generalization, and operational risk. I work end-to-end—from problem framing and technical strategy to execution and deployment—bringing clarity, rigor, and pragmatism to AI initiatives.

One thought on “Why arithmetic models look dumb long after they’ve learned the rule

Leave a comment