Findings

Every claim, with the status it earned.

A finding is not a headline. Each one states the question, the hypothesis, the method, its controls, the result, and the alternative explanations that could still be true. A status is assigned, and it can be downgraded. Refuted and retracted findings stay on this page.

11 findings 5 supported 2 signal 1 inconclusive 3 refuted 0 retracted
F-11 Signal

A falsifier-sensitive, geometry-retained novelty signal

A signal, not a claim: sparse and single-run, held only on the stricter readout, with the open-ended-evolution gate still failing closed.

Question
Is there novelty in the substrate that depends on its real geometry and resource structure, rather than on noise or generic churn, and does it hold under a longer horizon?
Hypothesis
If the novelty is structure-dependent, a strict geometry-retained readout should fire in the true substrate and stay silent when the structure is shuffled or destroyed, and it should not wash out as the horizon lengthens.
Method
A matched panel over 12 seeds compared the real substrate against a shuffled-structure control and a destroyed-structure control on a strict geometry-retained novelty readout, run at a shorter horizon and again at a longer bounded horizon, with all artifacts verified.
Controls
Two falsifiers at matched volume: shuffled structure and destroyed structure. A broader sustained-novelty readout was reported alongside as a frontier check.
Result
The strict readout rose from 1 of 12 seeds at the shorter horizon to 3 of 12 at the longer horizon, while both the shuffled and destroyed controls stayed at 0 of 12. The signal strengthened with horizon rather than washing out, and broke under shuffled or destroyed structure. Artifacts verified cleanly, with no hash mismatch, no gate weakening, no threshold retuning, and nothing promoted.
Alternatives
The signal is sparse, 3 of 12, and from a single overnight run. A broader sustained-novelty readout still fired across all three arms, so it is too permissive here and is not claim-bearing. The strict open-ended-evolution gate still fails closed. This is evidence of a structure-dependent adaptive trace, not life, autopoiesis, or open-ended evolution.
F-10 Refuted

Burst timing is not yet a defensible adaptive signal

Question
Is the headline live-contrast advantage from stall-coupled structural bursts an adaptive signal, or could matched-volume random churn reproduce it?
Hypothesis
If burst timing is adaptive, bursts fired on live stall state should beat a control that fires the same volume of structural churn at random points.
Method
A matched-random burst control: the same structural burst volume as the stall-coupled condition (1,182 burst children), but fired at random points rather than on live stall state.
Controls
The matched-random control is the falsifier. Both arms were verified, run to the same scale, and checked against the open-ended-evolution gate.
Result
A near-tie: the stall-coupled condition produced 41 live-contrast advances, the matched-random control 42. Both failed the open-ended-evolution gate, both plateaued on complexity, and the control verified cleanly with no claim promoted.
Alternatives
Random churn, matched for volume, reproduced or slightly exceeded the result, so burst timing cannot yet be read as adaptive. The falsifier did its job and stopped a tempting over-read. This strengthens the evidence machinery; it advances no life or open-ended-evolution claim.
F-08 Supported

A memory-bearing drive expresses richer, heritable behaviour

Question
Can a stateful drive with a bounded memory let the organism express a wider range of distinct behaviours, and is that range inherited rather than imposed by the measurement?
Hypothesis
If the memory adds real behavioural structure, expressible behavioural dimensionality should rise above the previous linear drive, and the lift should track the inherited genome rather than the metric.
Method
Compare a stateful, memory-bearing drive against the previous linear drive on expressible behavioural dimensionality, with a sweep over the number of internal modes and a parent-child inheritance check.
Controls
A check that the lift is genome-driven, not injected by the metric. A mode-count sweep to confirm the number is not a free knob. Parent-child field similarity across inheritance.
Result
Expressible behavioural dimensionality rose from roughly 2 to 5. The lift is genome-driven, not metric-injected, and heritable, with parent and child fields staying highly similar. The sweep showed 5 is not a free knob: adding more modes does not simply inflate it, which also reveals a ceiling in the current setup.
Alternatives
Expressibility is not selectability (see F-09): a wider range of producible behaviours does not mean selection can maintain them. This is one measure with a known ceiling, and it is not life, open-ended evolution, or sustained novelty.
F-09 Refuted

Priced selection did not maintain the richer behaviours

Question
Under priced fitness, where behaviours must pay their way to survive, are the richer dimension-5 behaviours selectable, so selection maintains them?
Hypothesis
If richer expressible behaviour were enough, priced selection should keep the dimension-5 behaviours alive.
Method
Run the selectability checkpoint under priced fitness on the dimension-5 behaviours from F-08, measuring viable dimensionality after selection.
Controls
Priced fitness as the selection pressure; viable dimensionality measured against the expressible dimensionality from F-08.
Result
An honest null. The richer behaviours were not selectable; viable dimensionality collapsed back to 2. The missing piece is not more drive complexity: the ecology presents only about two meaningful resource channels, so selection has only about two things to reward.
Alternatives
This refutes the idea that drive complexity alone is enough; it does not refute the approach. The diagnosis is the ecology, not the drive, and the next experiment widens it to at least five priced resource channels before rerunning this gate.
F-07 Signal

Online weight updates move short-horizon body-world energy (early, cross-model)

A final-energy result, not a survival claim: the alive counts are close (10/12 online vs 9/12 each control), and "alive" and "death" here are strict bookkeeping terms.

Question
Does letting the brain's weights slide online, through small adapter updates during a life, change the body's outcomes, beyond a frozen brain and beyond no brain?
Hypothesis
If online weight updates are load-bearing, an online-adapter arm should beat matched frozen-brain and no-brain arms on body-world energy outcomes, and the changed weights should actually be used in later forward passes.
Method
Matched-control lifecycle runs (online-adapter, frozen-brain, no-brain) over 36-tick lives. Dolphin at n=12; an independent family, Qwen, at n=3, preregistered before running.
Controls
A frozen-brain arm (same brain, no updates) and a no-brain arm, matched. The updates are verified to be used: 432 online adapter updates, 420 of them feeding changed weights into later forward passes.
Result
Dolphin n=12: the online arm's final energy beat the frozen arm in 11/12 seeds and the no-brain arm in 12/12 (mean about +10 vs frozen, +11 vs no-brain; one-sided sign tests p ≈ 0.003 and p ≈ 0.0002). Qwen n=3 reproduced the direction (online ahead in 3/3, mean about +12) but at p = 0.125, insufficient on its own.
Alternatives
This is a final-energy result, not a survival claim, and the cross-family signal is only directional (Qwen n=3, p = 0.125), so it is early evidence, not a settled finding. It weakens but does not refute a single-model quirk, and shows no viability, homeostasis, biological adaptation, or anything about artificial life. We do not promote "survives better" as a claim, and hold the statistical and viability claims open.
F-06 Supported

A frozen model's learned prior is load-bearing for a body's survival

Question
With the model demoted to a bounded brain that can only nudge a mortal body's parameters, does it actually do anything, or could noise replace it?
Hypothesis
If the trained model's prior is load-bearing, its choices should improve the body's survival beyond every control, including a brain that proposes perfectly valid adjustments chosen at random.
Method
A survival-recovery task: the body takes a perturbation that kills the no-brain version, with a limited window to recover. Five independent runs per arm, reproduced on a second model family and on a longer task.
Controls
Four, and the result counts only if it beats all of them: no brain at all; a brain emitting invalid signals; a brain with randomized weights (same architecture, no learning); and a valid-random brain emitting valid adjustments chosen at random.
Result
A frozen Qwen model's choices beat all four controls across five runs. A second family (Dolphin, on Llama) reproduced the result independently, beating the valid-random control on 5 of 5 runs (p ≈ 0.006), and the effect held on a longer task. The model's choices, not merely the act of emitting a valid signal, measurably improve survival.
Alternatives
Generalization is unshown: this is one task with two models, not structurally different ones, and the randomized-weights control will be strengthened. This is not life, consciousness, or self-maintenance, and the body does not yet reproduce or evolve under selection. A parser bug that once credited an ambiguous output was caught and binned, and an arm whose improvement vanished after the fix was dropped.
F-01 Supported

A structural attractor forms in the held cache, and controlled discontinuity breaks it

Question
When a model runs continuously with its KV-cache held as memory, does it settle into a self-reinforcing behavioural lock that is a real structural property of the cache, rather than an artifact of the prompt text?
Hypothesis
If the lock is structural, breaking continuity (resetting the held cache) should break it, a matched run that keeps continuity should stay locked, and a text-only cue with no real cache change should not break it.
Method
Three matched 200-tick arms, n=3 per arm. Measure final-100-tick action entropy and how many runs lock onto a single action. A separate 1000-tick three-arm differential corroborates.
Controls
A no-reset control that keeps continuity, and a SENSE-only confound control that injects the discontinuity as text without changing the cache. Temperature 0, fixed seed.
Result
Discontinuity arm: 0/3 runs locked, mean final-window entropy 1.23, about 123 moves. No-reset control: 3/3 locked, entropy 0.00, about 35 moves. SENSE-only control: 3/3 locked. The text cue alone does not break the lock; the effect requires the real held cache.
Alternatives
Sampling noise (addressed by n=3 per arm and entropy 0.00 in controls). The prompt text driving the change (ruled out by the SENSE-only control). Small-sample fragility remains: n=3 is small, so this is reported as supported, not settled, and is being re-run at larger n.
F-02 Supported

A saved cache restores token-identically across a fresh process

Question
Is the held cache a faithful carrier of a life: does saving and restoring it reproduce the same organism, or does restoration quietly alter behaviour?
Hypothesis
If restore is faithful, a fresh process restoring a saved cache should decode the exact same action tokens as the original process.
Method
A create-process versus restore-process comparison. Temperature 0, fixed seed, 435-token prompt, about 70 MiB cache. Compare the first 8 decoded action token ids.
Controls
Two genuinely separate OS processes, not a same-process reload. Corroborated against a 1000-tick differential in which all 20/20 windows matched.
Result
8/8 decoded token ids identical; ok: true. Restoration is deterministic under fixed seed and temperature: a saved run restores token-for-token. (Measured under the earlier architecture, when the model was run as the organism; see entry 010.)
Alternatives
Same-process caching masking a difference (ruled out by using separate processes). Coincidence on a short window (the corroborating 1000-tick run matched 20/20).
F-03 Supported

A belief the body learned was load-bearing in the decision that ended a life

Single-life trace, n=1: demonstrates the mechanism exists, not how often it occurs.

Question
Can a belief the organism forms about its own body persist within a life and shape a later, consequential decision?
Hypothesis
If body-learned beliefs are causally active, a single perturbation should be traceable through a belief that is promoted, carried, and then cited in the dying decision.
Method
A single-life trace (Life 43): follow one perturbation through belief promotion, persistence, and the final decision window.
Controls
This is a single-life trace, n=1. It is reported as mechanism-existence only and makes no frequency claim.
Result
Perturbation at tick 30864; belief promoted at tick 30980 (confidence 0.80) and rendered every tick to death at tick 32313. Strict active-belief duration 1,332 ticks; the belief is cited in the decision window, ticks 32308 to 32312.
Alternatives
Coincidental timing (the belief is explicitly cited in the logged dying decision, not merely co-present). Over-reading one life (stated plainly as n=1).
F-04 Inconclusive

Does the organism actively maintain itself? A signal withdrawn on review

Correction (2026-05-30): the closed-loop reading first logged on 2026-05-29 did not survive a deeper audit and is withdrawn. See Result.

Question
Does the organism do work to preserve its own coherence: notice a decayed self-belief and act, on its own, to restore it?
Hypothesis
If self-maintenance is real, then with no copyable instruction in the cue the organism should still close the loop: a decayed belief, a self-elected repair action, a rehearsal increment, and restored confidence.
Method
A single life under aggressive, engineered belief decay (configured to decay within about one tick), with the maintenance affordance enabled and no copyable instruction handed to the model. An earlier six-life run that suggested an enabled-versus-baseline asymmetry was found to be confounded by a copyable instruction in the cue; that confound was removed here.
Controls
The affordance-disabled control was run and produced the same belief-refresh as a trivial repeated-probe latch, which the scorer cannot distinguish from genuine closure. The no-decay baseline was not run. The test is not yet discriminating.
Result
On first analysis the loop appeared to close with no copyable instruction. On review the closed-loop credit did not hold: it rested on a coincidence in the scorer clock-ordering and scores zero under correct temporal semantics. What stands is narrow: the model elected a probe under a non-imperative cue, and one belief was refreshed (rehearsal 0 to 1, effective confidence 0.29 to 0.80), once, before re-decay. The phrase "strict closed self-maintenance loop" is withdrawn. As of 2026-05-30 the closure smoke test scores 0 and is blocked by an engineering launch bug (runtime recovery/readout), not a scientific negative.
Alternatives
The remaining observation does not separate self-maintenance from a repeated-probe latch. The decay was engineered-aggressive, not spontaneous; n=1; the readout layer is unstable; the refreshed belief content ("critical energy depletion") did not match the actual state (energy 0.72). A scorer with correct temporal semantics, and a control that separates genuine closure from a latch, are the next steps.
F-05 Refuted

An artificial nervous system with no model at the centre cannot sustain open-ended behaviour

Question
Can a hand-built artificial nervous system, with no language model at the centre, sustain open-ended, non-repeating behaviour?
Hypothesis
A sufficiently rich hand-built dynamical substrate would self-organise into open-ended behaviour without a language model (the original Path A bet).
Method
Built and ran the Path A architecture (the artificial nervous system) to long horizons, observing the state trajectory.
Controls
Long-horizon observation of whether the trajectory kept changing or converged.
Result
Refuted, for the architecture as built. The system collapsed into a single deterministic attractor and stopped changing. Path A was retired and the project pivoted to Path B.
Alternatives
Insufficient richness rather than a fundamental limit: possible, but this line was not pursued after the pivot. The negative result stands for the architecture as built.

Retractions

No published finding has been retracted to date. When one is, it will stay on this page with its status set to retracted and a short note on why it fell. A finding that quietly disappears is a finding you cannot trust.

A status is a promise we can be held to.

Not: supported is not proven. It means the result has survived its controls so far, at the sample size stated, and nothing more.

Follow each result to its raw evidence →