Between Memorization and Meaning: When Neural Networks Learn, But Not the Way We Expect

An empirical exploration of why arithmetic models sometimes neither memorize nor generalize

(Part 2 of an ongoing exploration into how small language models learn arithmetic)

In the previous post, I described an experiment that started with a simple goal:
to observe memorization and eventual grokking in a small language model trained on arithmetic operations.

The setup was straightforward.
The expectation was almost textbook.

Train long enough → memorization emerges → test loss collapses → generalization appears.

But what actually happened was far more interesting.

This post documents what I observed after extending training to 1,000 epochs, why the expected “grokking moment” never arrived, and what this reveals about the tension between memorization, optimization, and learning in language models.

Recap (Brief)

In Part 1, I trained a small transformer (DistilGPT-2) on synthetic arithmetic data involving addition, subtraction, multiplication, and division.

The goal was to observe:

Whether the model would memorize the dataset
Whether memorization would precede generalization
Whether a clear grokking phase would emerge

By ~300-500 epochs, something curious was already happening:

Training loss was decreasing slowly
Evaluation loss was also decreasing
Yet the model was still producing many incorrect numeric outputs

This contradicted the simple story that:

“Models memorize first, then generalize.”

So I kept training.

What happened after 1000 epochs

At ~1000 epochs, the picture became clearer — and stranger.

Observed metrics

Training loss plateaued around 0.86
Evaluation loss drifted upward toward 1.6
Gradient norms remained stable
No sudden phase transition
No collapse into perfect memorization

Despite massive training time, the model never entered a regime where it simply “remembered” the dataset.

Yet it wasn’t random either.

What the model was actually doing

When inspecting predictions, a consistent pattern emerged:

The operation was almost always correct
(addition, subtraction, division identified correctly)
The magnitude was often close
The digits were frequently wrong

Example:

Input:  SUBTRACT 521 601
Output: -808088512
Target: -80

Input: DIVIDE 845 79
Output: 10.69620253
Target: 10.69620253164557

This wasn’t noise.

This was structure without precision.

Why memorization never happened

At first glance, this seems odd.

The dataset was small (~10k examples).
The model was large enough to memorize it.
So why didn’t it?

The answer lies in how the optimization problem was shaped.

Memorization was expensive

To memorize arithmetic examples exactly, the model would need to store long token sequences with high precision.

That means:

remembering long digit strings
learning exact termination behavior
assigning high confidence to precise token sequences

This is expensive in parameter space.

In contrast, learning approximate arithmetic rules is cheaper.

SGD naturally prefers solutions that:

reduce loss quickly
generalize across many samples
reuse shared structure

So the model took the path of least resistance.

It learned how arithmetic behaves without committing to exact outputs.

The loss function rewarded “almost right”

Language modeling loss is token-based.

That means:

If the model predicts most digits correctly but misses a few,
It still receives useful gradient signal.

So during training, the model is constantly reinforced for being approximately right.

This creates a subtle but powerful effect:

The model is rewarded for learning the procedure, not the exact answer.

In other words, the optimization objective quietly prefers understanding over memorization.

Why grokking never appeared

Classic grokking shows a sudden collapse in test loss after long stagnation.

That phenomenon usually requires:

Strong regularization
A sharp transition between rule discovery and rule application
A training signal that punishes partial correctness

In this experiment:

The loss never became harsh enough
Partial correctness was always rewarded
There was no incentive to “snap” into exact symbolic computation

So the model stayed in a stable middle ground:

It learned how arithmetic works, but never felt pressure to be perfect at it.

Between Two Worlds

This experiment ended up living in an interesting middle space.

Not memorization.
Not grokking.
Something in between.

The model learned enough structure to behave intelligently,
but never crossed the threshold into exact symbolic computation.

It was neither guessing nor recalling.
It was reasoning — approximately.

That, in itself, is revealing.

What this tells us

Memorization is not the default outcome of training — even with enough capacity.
Grokking is not inevitable — it requires specific pressure.
Optimization dynamics matter more than model size.
Language models can behave like approximate algorithm learners without ever becoming exact ones.

Or put more simply:

The model knew what to do long before it knew how to do it perfectly.

What could be a cool experiment next

I haven’t decided what it will be, yet!

Possibilities include:

Changing the loss to punish approximation
Reducing vocabulary to force symbolic precision
Isolating arithmetic types (addition only)
Or pushing the same setup until it breaks in an interesting way

For now, I’m leaving the experiment here — not as a conclusion, but as a checkpoint.

Appendix (for readers who want details)

Model: DistilGPT-2

Training: ~1000 epochs

Task: Mixed arithmetic (add / subtract / multiply / divide)

Key observation: sustained partial correctness without memorization

Artifacts:

eval_result Download

Training & evaluation loss curves after 1000 epochs

If you’ve read this far, you’re exactly the audience this experiment was for.

This wasn’t about getting a model to work.
It was about watching how learning actually unfolds when nothing forces it to be perfect.

And sometimes, that’s where the most interesting behavior lives.

Between Memorization and Meaning: When Neural Networks Learn, But Not the Way We Expect

Recap (Brief)

What happened after 1000 epochs

Observed metrics

What the model was actually doing

Why memorization never happened

Memorization was expensive

The loss function rewarded “almost right”

Why grokking never appeared

Between Two Worlds

What this tells us

What could be a cool experiment next

Appendix (for readers who want details)

Published by Sam Banerjee

Leave a comment Cancel reply

Recap (Brief)

What happened after 1000 epochs

Observed metrics

What the model was actually doing

Why memorization never happened

Memorization was expensive

The loss function rewarded “almost right”

Why grokking never appeared

Between Two Worlds

What this tells us

What could be a cool experiment next

Appendix (for readers who want details)

Share if you care

Related

Published by Sam Banerjee

Leave a comment Cancel reply