Between Memorization and Meaning: When Neural Networks Learn, But Not the Way We Expect

An empirical exploration of why arithmetic models sometimes neither memorize nor generalize

(Part 2 of an ongoing exploration into how small language models learn arithmetic)


In the previous post, I described an experiment that started with a simple goal:
to observe memorization and eventual grokking in a small language model trained on arithmetic operations.

The setup was straightforward.
The expectation was almost textbook.

Train long enough → memorization emerges → test loss collapses → generalization appears.

But what actually happened was far more interesting.

This post documents what I observed after extending training to 1,000 epochs, why the expected “grokking moment” never arrived, and what this reveals about the tension between memorization, optimization, and learning in language models.


Recap (Brief)

In Part 1, I trained a small transformer (DistilGPT-2) on synthetic arithmetic data involving addition, subtraction, multiplication, and division.

The goal was to observe:

  • Whether the model would memorize the dataset
  • Whether memorization would precede generalization
  • Whether a clear grokking phase would emerge

By ~300-500 epochs, something curious was already happening:

  • Training loss was decreasing slowly
  • Evaluation loss was also decreasing
  • Yet the model was still producing many incorrect numeric outputs

This contradicted the simple story that:

“Models memorize first, then generalize.”

So I kept training.


What happened after 1000 epochs

At ~1000 epochs, the picture became clearer — and stranger.

Observed metrics

  • Training loss plateaued around 0.86
  • Evaluation loss drifted upward toward 1.6
  • Gradient norms remained stable
  • No sudden phase transition
  • No collapse into perfect memorization

Despite massive training time, the model never entered a regime where it simply “remembered” the dataset.

Yet it wasn’t random either.


What the model was actually doing

When inspecting predictions, a consistent pattern emerged:

  • The operation was almost always correct
    (addition, subtraction, division identified correctly)
  • The magnitude was often close
  • The digits were frequently wrong

Example:

Input:  SUBTRACT 521 601
Output: -808088512
Target: -80

or

Input: DIVIDE 845 79
Output: 10.69620253
Target: 10.69620253164557

This wasn’t noise.

This was structure without precision.


Why memorization never happened

At first glance, this seems odd.

The dataset was small (~10k examples).
The model was large enough to memorize it.
So why didn’t it?

The answer lies in how the optimization problem was shaped.

Memorization was expensive

To memorize arithmetic examples exactly, the model would need to store long token sequences with high precision.

That means:

  • remembering long digit strings
  • learning exact termination behavior
  • assigning high confidence to precise token sequences

This is expensive in parameter space.

In contrast, learning approximate arithmetic rules is cheaper.

SGD naturally prefers solutions that:

  • reduce loss quickly
  • generalize across many samples
  • reuse shared structure

So the model took the path of least resistance.

It learned how arithmetic behaves without committing to exact outputs.


The loss function rewarded “almost right”

Language modeling loss is token-based.

That means:

  • If the model predicts most digits correctly but misses a few,
  • It still receives useful gradient signal.

So during training, the model is constantly reinforced for being approximately right.

This creates a subtle but powerful effect:

The model is rewarded for learning the procedure, not the exact answer.

In other words, the optimization objective quietly prefers understanding over memorization.


Why grokking never appeared

Classic grokking shows a sudden collapse in test loss after long stagnation.

That phenomenon usually requires:

  • Strong regularization
  • A sharp transition between rule discovery and rule application
  • A training signal that punishes partial correctness

In this experiment:

  • The loss never became harsh enough
  • Partial correctness was always rewarded
  • There was no incentive to “snap” into exact symbolic computation

So the model stayed in a stable middle ground:

It learned how arithmetic works, but never felt pressure to be perfect at it.


Between Two Worlds

This experiment ended up living in an interesting middle space.

Not memorization.
Not grokking.
Something in between.

The model learned enough structure to behave intelligently,
but never crossed the threshold into exact symbolic computation.

It was neither guessing nor recalling.
It was reasoning — approximately.

That, in itself, is revealing.


What this tells us

  • Memorization is not the default outcome of training — even with enough capacity.
  • Grokking is not inevitable — it requires specific pressure.
  • Optimization dynamics matter more than model size.
  • Language models can behave like approximate algorithm learners without ever becoming exact ones.

Or put more simply:

The model knew what to do long before it knew how to do it perfectly.


What could be a cool experiment next

I haven’t decided what it will be, yet!

Possibilities include:

  • Changing the loss to punish approximation
  • Reducing vocabulary to force symbolic precision
  • Isolating arithmetic types (addition only)
  • Or pushing the same setup until it breaks in an interesting way

For now, I’m leaving the experiment here — not as a conclusion, but as a checkpoint.


Appendix (for readers who want details)

  • Model: DistilGPT-2
  • Training: ~1000 epochs
  • Task: Mixed arithmetic (add / subtract / multiply / divide)
  • Key observation: sustained partial correctness without memorization
  • Artifacts:
Training & evaluation loss curves after 1000 epochs

If you’ve read this far, you’re exactly the audience this experiment was for.

This wasn’t about getting a model to work.
It was about watching how learning actually unfolds when nothing forces it to be perfect.

And sometimes, that’s where the most interesting behavior lives.

Published by Sam Banerjee

I’m an AI and software engineering consultant who helps organizations design and deliver production-ready AI systems. I specialize in translating complex machine learning ideas into scalable, reliable solutions that work under real-world constraints. My focus spans AI architecture, applied ML, and system design, with an emphasis on model behavior, generalization, and operational risk. I work end-to-end—from problem framing and technical strategy to execution and deployment—bringing clarity, rigor, and pragmatism to AI initiatives.

Leave a comment