Why arithmetic models look dumb long after they’ve learned the rule

An experiment in memorization, grokking, and misleading loss curves

This post documents an experiment that didn’t go the way I expected.
What started as a simple attempt to observe memorization and grokking in arithmetic models turned into a deeper lesson about how misleading loss curves can be — especially for algorithmic tasks.

What I expected to see

I went into this experiment with a fairly standard mental model of how learning unfolds in neural networks, particularly for structured, rule-based problems.

The story usually goes like this:

Training loss collapses first, as the model memorizes patterns in the training data.
Evaluation loss remains high, because the model fails on unseen examples.
After sufficient training, evaluation loss suddenly drops, signaling grokking — the moment the model discovers the underlying rule and generalizes.

Arithmetic felt like an ideal testbed for observing this behavior. The rules are precise. The outputs are unambiguous. There’s nowhere to hide behind semantics or subjective interpretation.

I expected the training loss curve to tell a clean, familiar story.

That’s not what happened.

The experiment (briefly)

I fine-tuned a small transformer (DistilGPT-2) on synthetic arithmetic tasks:

Operations: addition, subtraction, multiplication, division
Inputs: structured text (e.g., DIVIDE 845 79)
Outputs: exact numerical answers
Training setup:
- Constant learning rate
- Weight decay
- Long-horizon training (500+ epochs)
- Standard cross-entropy loss

The goal wasn’t performance or benchmarking. It was to observe learning dynamics in a controlled setting.

The first surprise: loss curves didn’t behave as expected

Training and evaluation loss over long-horizon fine-tuning. Evaluation loss improves dramatically while training loss declines slowly, challenging the usual interpretation of memorization preceding generalization.

What I saw early on was confusing:

Training loss decreased slowly and stubbornly.
Evaluation loss dropped dramatically — early and consistently.
Gradients were stable, learning rate constant, and training entirely well-behaved.

At first glance, this didn’t make sense.

If training loss represents memorization, and evaluation loss represents generalization, how could the model be generalizing before it memorized?

The loss curves suggested confusion.
But they weren’t unstable or noisy — they were calmly telling a story I didn’t yet understand.

When loss curves stopped explaining what was happening

Trying to reason purely from the curves got me nowhere.

So I stopped staring at the loss and started looking directly at the model’s outputs.

That’s where things finally clicked.

Looking at predictions changed everything

Operation	Input	Expected Output	Model Output	Observation
DIVIDE	845 ÷ 79	10.69620253	10.69620253	Correct to several decimals
DIVIDE	356 ÷ 936	0.38034188	0.3165618	Numerically close, wrong digits
SUM	568 + 390	958	958	Exact match
MULTIPLY	974 × 915	891210	8912101025	Correct structure, failed termination
SUBTRACT	190 − 498	-308	-308781889	Correct sign, magnitude explosion

Many failures reflect partial algorithmic correctness rather than random guessing — behavior invisible to token-level loss.

Once I inspected individual predictions, clear patterns emerged.

The model was often:

Applying the correct operation
Producing values with the right order of magnitude
Generating numerically close answers, especially for division

But the answers were still wrong.

And in arithmetic, close is indistinguishable from incorrect.

One extra digit.
One misplaced decimal.
One failure to stop generation at the right time.

From the perspective of cross-entropy loss, these are complete failures.

Behaviorally, they tell a very different story.

The model almost always applies the correct operation, even when the answer is wrong.

Arithmetic is unforgiving — and loss reflects that

Arithmetic exposes a fundamental weakness in token-level loss metrics:

Partial correctness is invisible.
Near-correct answers receive the same penalty as wildly incorrect ones.
One wrong digit invalidates the entire sequence.

This is especially severe for division. Predicting 0.316 instead of 0.380 is numerically close, but token-wise it’s treated as fully wrong.

This leads to an important realization:

The model learned the rule long before it learned precision.

Loss does not acknowledge that.

A reframing that made things click

At some point, I wrote this down while reviewing outputs:

“I know how to compute, but I haven’t learned the termination constraint.”

That single sentence explained many of the failures I was seeing:

Correct digit-level computation
Correct sign and scale
Failure to stop at the right point
Extra digits appended to otherwise correct answers

This isn’t random guessing.
It’s incomplete algorithm execution.

Which leads to the core insight of this experiment:

The model didn’t fail to learn arithmetic — our metrics failed to notice when it did.

Loss vs learning

This experiment forced me to rethink what loss curves are actually measuring.

The distinction that matters is this:

Grokking is about representations.
Loss is about tokens.

For arithmetic tasks, these two can drift far apart.

That’s why evaluation loss can improve early, training loss can remain high, and the model can look “dumb” long after it has learned the underlying rule.

Put more bluntly:

Loss curves can lie badly for algorithmic tasks.

The wrong question — and the right one

Early on, I kept implicitly asking:

Is the answer exactly right?

But that turned out to be the wrong question.

The more informative question is:

Is the model executing the correct procedure?

Once I started evaluating predictions with that lens — including decimal-aware comparisons for division — the behavior made sense.

What this does (and does not) claim

This experiment does not prove that:

The model “understands” arithmetic in a human sense
Grokking has already occurred
Loss functions are fundamentally broken

What it does suggest is more subtle:

Arithmetic models don’t fail because they haven’t learned the rule — they fail because our metrics demand perfection before acknowledging understanding.

That distinction matters, especially when studying learning dynamics.

What happens next

Training is still ongoing beyond 500 epochs.

If classical grokking appears, I expect to see:

A sharp collapse in training loss
A discrete jump in exact-match accuracy
The disappearance of termination and precision errors

Or it may not happen at all.

Either outcome is informative.

For now, the most interesting result isn’t whether grokking eventually happens —
it’s how much learning can occur before our metrics notice it.

Why arithmetic models look dumb long after they’ve learned the rule

An experiment in memorization, grokking, and misleading loss curves

What I expected to see

The experiment (briefly)

The first surprise: loss curves didn’t behave as expected

When loss curves stopped explaining what was happening

Looking at predictions changed everything

Arithmetic is unforgiving — and loss reflects that

A reframing that made things click

Loss vs learning

The wrong question — and the right one

What this does (and does not) claim

What happens next

Published by Sam Banerjee

One thought on “Why arithmetic models look dumb long after they’ve learned the rule”

Leave a comment Cancel reply

An experiment in memorization, grokking, and misleading loss curves

What I expected to see

The experiment (briefly)

The first surprise: loss curves didn’t behave as expected

When loss curves stopped explaining what was happening

Looking at predictions changed everything

Arithmetic is unforgiving — and loss reflects that

A reframing that made things click

Loss vs learning

The wrong question — and the right one

What this does (and does not) claim

What happens next

Share if you care

Related

Published by Sam Banerjee

One thought on “Why arithmetic models look dumb long after they’ve learned the rule”

Leave a comment Cancel reply