Where we are: does the language model actually do anything?

Synthena is an attempt to build something that behaves like a living creature on a laptop, with an eye toward a physical body eventually: not by scaling a chatbot, but by building a body that can live, take damage, and die, and asking what role a language model can honestly play inside it.

This is a lab notebook. We publish what we measure, including the parts that don’t work, and we try not to claim more than the controls support.

A course-correction

For a while we tried to make the language model itself be the organism, its running text the “mind” of the creature. We’ve stopped, and we can now say why precisely. A transformer is a pure function: it is inert between calls, it reads text descriptions rather than the state of a world, and nothing inside it persists or evolves on its own. Asking it to “be alive” was a category error, and our long run of null results was the signature of a wrong substrate, not of bad tuning.

So we changed the shape. The body is now a continuous, mortal system: it has energy, it accumulates damage, it can be perturbed, and it can die. The language model is demoted to a bounded brain, an organ. It sees a compact summary of the body’s state, and it can only propose small adjustments to the body’s internal parameters (a lean toward repair, a lean toward seeking food). It cannot move, eat, or act directly. The body lives or dies on its own physics; the brain only nudges.

The question we made falsifiable

Once the model is just an organ, the honest skeptical question is unavoidable: does it actually do anything? Could you replace it with noise and get the same result?

So we built controls, and the result only counts if it beats all of them:

no brain at all;
a brain that emits invalid signals: does the body just need a signal?
a brain with its weights randomized: same architecture, no learning, so does the trained model matter, or just the shape of one?
and the sharp one, a “valid-random” brain that emits perfectly valid adjustments chosen at random: does the model’s choice matter, or just the act of proposing a valid move?

If the real model can’t beat valid-but-random choices, its “thinking” is decorative.

What we measured

On a survival-recovery task, where the body is hit with a perturbation that kills the no-brain version and has a limited window to recover, we measured:

A frozen Qwen model’s choices beat all four controls across five independent runs.
A second model from a different family (Dolphin, built on Llama) reproduced the key result on its own: it beat the valid-random control on every one of five runs (p ≈ 0.006), and the effect held when we made the task run longer.

Plainly: the language model’s choices, not merely the act of emitting a valid signal, measurably improve the body’s survival, beyond what random valid choices achieve. Replace the model with noise, and survival gets worse.

Keeping ourselves honest

The part we’re most careful about is not the result, it’s the discipline around it. From this stretch:

We caught a parser bug that was crediting an ambiguous model output, fixed it, and re-ran, and one model’s apparent improvement disappeared once the bug was gone. That arm went in the bin.
Before running any longer test, we check that a hand-coded optimal policy can actually survive it. When even the optimal policy died of energy depletion on a longer task, we refused to run the model there: a failure on an unsurvivable task tells you nothing.
Every result carries an explicit “we are not claiming” line, and we record only what survives the controls.

What this is not

This is not life, not consciousness, not self-maintenance in the technical sense, and we are not claiming any of them. It is a controlled measurement that a frozen language model’s learned prior is load-bearing for a simple body’s survival, on one task, with two models. We have not shown it generalizes across structurally different tasks, and the body does not yet reproduce or evolve under selection.

What’s next

Two things turn a promising shape into a finding: a second, structurally different task, one where the body must manage energy and forage rather than just repair damage, and a stronger version of the randomized-weights control. Both are in progress.

And separately, the thing most people mean by “artificial life”, a body that reproduces, mutates, and is selected across its own deaths, is a different and harder line of work. It is where we go after the current question is closed honestly.

We’ll post the next update when there’s something measured to report.