Findings

Every claim, with the status it earned.

A finding is not a headline. Each one states the question, the hypothesis, the method, its controls, the result, and the alternative explanations that could still be true. A status is assigned, and it can be downgraded. Refuted and retracted findings stay on this page.

11 findings 5 supported 2 signal 1 inconclusive 3 refuted 0 retracted

F-11 Signal 08 Jun 2026

A falsifier-sensitive, geometry-retained novelty signal

A signal, not a claim: sparse and single-run, held only on the stricter readout, with the open-ended-evolution gate still failing closed.

Question: Is there novelty in the substrate that depends on its real geometry and resource structure, rather than on noise or generic churn, and does it hold under a longer horizon?
Hypothesis: If the novelty is structure-dependent, a strict geometry-retained readout should fire in the true substrate and stay silent when the structure is shuffled or destroyed, and it should not wash out as the horizon lengthens.
Method: A matched panel over 12 seeds compared the real substrate against a shuffled-structure control and a destroyed-structure control on a strict geometry-retained novelty readout, run at a shorter horizon and again at a longer bounded horizon, with all artifacts verified.
Controls: Two falsifiers at matched volume: shuffled structure and destroyed structure. A broader sustained-novelty readout was reported alongside as a frontier check.
Result: The strict readout rose from 1 of 12 seeds at the shorter horizon to 3 of 12 at the longer horizon, while both the shuffled and destroyed controls stayed at 0 of 12. The signal strengthened with horizon rather than washing out, and broke under shuffled or destroyed structure. Artifacts verified cleanly, with no hash mismatch, no gate weakening, no threshold retuning, and nothing promoted.
Alternatives: The signal is sparse, 3 of 12, and from a single overnight run. A broader sustained-novelty readout still fired across all three arms, so it is too permissive here and is not claim-bearing. The strict open-ended-evolution gate still fails closed. This is evidence of a structure-dependent adaptive trace, not life, autopoiesis, or open-ended evolution.

F-10 Refuted 05 Jun 2026

Burst timing is not yet a defensible adaptive signal

Question: Is the headline live-contrast advantage from stall-coupled structural bursts an adaptive signal, or could matched-volume random churn reproduce it?
Hypothesis: If burst timing is adaptive, bursts fired on live stall state should beat a control that fires the same volume of structural churn at random points.
Method: A matched-random burst control: the same structural burst volume as the stall-coupled condition (1,182 burst children), but fired at random points rather than on live stall state.
Controls: The matched-random control is the falsifier. Both arms were verified, run to the same scale, and checked against the open-ended-evolution gate.
Result: A near-tie: the stall-coupled condition produced 41 live-contrast advances, the matched-random control 42. Both failed the open-ended-evolution gate, both plateaued on complexity, and the control verified cleanly with no claim promoted.
Alternatives: Random churn, matched for volume, reproduced or slightly exceeded the result, so burst timing cannot yet be read as adaptive. The falsifier did its job and stopped a tempting over-read. This strengthens the evidence machinery; it advances no life or open-ended-evolution claim.

F-08 Supported 04 Jun 2026

A memory-bearing drive expresses richer, heritable behaviour

Question: Can a stateful drive with a bounded memory let the organism express a wider range of distinct behaviours, and is that range inherited rather than imposed by the measurement?
Hypothesis: If the memory adds real behavioural structure, expressible behavioural dimensionality should rise above the previous linear drive, and the lift should track the inherited genome rather than the metric.
Method: Compare a stateful, memory-bearing drive against the previous linear drive on expressible behavioural dimensionality, with a sweep over the number of internal modes and a parent-child inheritance check.
Controls: A check that the lift is genome-driven, not injected by the metric. A mode-count sweep to confirm the number is not a free knob. Parent-child field similarity across inheritance.
Result: Expressible behavioural dimensionality rose from roughly 2 to 5. The lift is genome-driven, not metric-injected, and heritable, with parent and child fields staying highly similar. The sweep showed 5 is not a free knob: adding more modes does not simply inflate it, which also reveals a ceiling in the current setup.
Alternatives: Expressibility is not selectability (see F-09): a wider range of producible behaviours does not mean selection can maintain them. This is one measure with a known ceiling, and it is not life, open-ended evolution, or sustained novelty.

F-09 Refuted 04 Jun 2026

Priced selection did not maintain the richer behaviours

Question: Under priced fitness, where behaviours must pay their way to survive, are the richer dimension-5 behaviours selectable, so selection maintains them?
Hypothesis: If richer expressible behaviour were enough, priced selection should keep the dimension-5 behaviours alive.
Method: Run the selectability checkpoint under priced fitness on the dimension-5 behaviours from F-08, measuring viable dimensionality after selection.
Controls: Priced fitness as the selection pressure; viable dimensionality measured against the expressible dimensionality from F-08.
Result: An honest null. The richer behaviours were not selectable; viable dimensionality collapsed back to 2. The missing piece is not more drive complexity: the ecology presents only about two meaningful resource channels, so selection has only about two things to reward.
Alternatives: This refutes the idea that drive complexity alone is enough; it does not refute the approach. The diagnosis is the ecology, not the drive, and the next experiment widens it to at least five priced resource channels before rerunning this gate.

F-07 Signal 03 Jun 2026

Online weight updates move short-horizon body-world energy (early, cross-model)

A final-energy result, not a survival claim: the alive counts are close (10/12 online vs 9/12 each control), and "alive" and "death" here are strict bookkeeping terms.

Question: Does letting the brain's weights slide online, through small adapter updates during a life, change the body's outcomes, beyond a frozen brain and beyond no brain?
Hypothesis: If online weight updates are load-bearing, an online-adapter arm should beat matched frozen-brain and no-brain arms on body-world energy outcomes, and the changed weights should actually be used in later forward passes.
Method: Matched-control lifecycle runs (online-adapter, frozen-brain, no-brain) over 36-tick lives. Dolphin at n=12; an independent family, Qwen, at n=3, preregistered before running.
Controls: A frozen-brain arm (same brain, no updates) and a no-brain arm, matched. The updates are verified to be used: 432 online adapter updates, 420 of them feeding changed weights into later forward passes.
Result: Dolphin n=12: the online arm's final energy beat the frozen arm in 11/12 seeds and the no-brain arm in 12/12 (mean about +10 vs frozen, +11 vs no-brain; one-sided sign tests p ≈ 0.003 and p ≈ 0.0002). Qwen n=3 reproduced the direction (online ahead in 3/3, mean about +12) but at p = 0.125, insufficient on its own.
Alternatives: This is a final-energy result, not a survival claim, and the cross-family signal is only directional (Qwen n=3, p = 0.125), so it is early evidence, not a settled finding. It weakens but does not refute a single-model quirk, and shows no viability, homeostasis, biological adaptation, or anything about artificial life. We do not promote "survives better" as a claim, and hold the statistical and viability claims open.

F-06 Supported 01 Jun 2026

A frozen model's learned prior is load-bearing for a body's survival

Question: With the model demoted to a bounded brain that can only nudge a mortal body's parameters, does it actually do anything, or could noise replace it?
Hypothesis: If the trained model's prior is load-bearing, its choices should improve the body's survival beyond every control, including a brain that proposes perfectly valid adjustments chosen at random.
Method: A survival-recovery task: the body takes a perturbation that kills the no-brain version, with a limited window to recover. Five independent runs per arm, reproduced on a second model family and on a longer task.
Controls: Four, and the result counts only if it beats all of them: no brain at all; a brain emitting invalid signals; a brain with randomized weights (same architecture, no learning); and a valid-random brain emitting valid adjustments chosen at random.
Result: A frozen Qwen model's choices beat all four controls across five runs. A second family (Dolphin, on Llama) reproduced the result independently, beating the valid-random control on 5 of 5 runs (p ≈ 0.006), and the effect held on a longer task. The model's choices, not merely the act of emitting a valid signal, measurably improve survival.
Alternatives: Generalization is unshown: this is one task with two models, not structurally different ones, and the randomized-weights control will be strengthened. This is not life, consciousness, or self-maintenance, and the body does not yet reproduce or evolve under selection. A parser bug that once credited an ambiguous output was caught and binned, and an arm whose improvement vanished after the fix was dropped.

F-01 Supported 25 May 2026

A structural attractor forms in the held cache, and controlled discontinuity breaks it

Question: When a model runs continuously with its KV-cache held as memory, does it settle into a self-reinforcing behavioural lock that is a real structural property of the cache, rather than an artifact of the prompt text?
Hypothesis: If the lock is structural, breaking continuity (resetting the held cache) should break it, a matched run that keeps continuity should stay locked, and a text-only cue with no real cache change should not break it.
Method: Three matched 200-tick arms, n=3 per arm. Measure final-100-tick action entropy and how many runs lock onto a single action. A separate 1000-tick three-arm differential corroborates.
Controls: A no-reset control that keeps continuity, and a SENSE-only confound control that injects the discontinuity as text without changing the cache. Temperature 0, fixed seed.
Result: Discontinuity arm: 0/3 runs locked, mean final-window entropy 1.23, about 123 moves. No-reset control: 3/3 locked, entropy 0.00, about 35 moves. SENSE-only control: 3/3 locked. The text cue alone does not break the lock; the effect requires the real held cache.
Alternatives: Sampling noise (addressed by n=3 per arm and entropy 0.00 in controls). The prompt text driving the change (ruled out by the SENSE-only control). Small-sample fragility remains: n=3 is small, so this is reported as supported, not settled, and is being re-run at larger n.

F-02 Supported 25 May 2026

A saved cache restores token-identically across a fresh process

Question: Is the held cache a faithful carrier of a life: does saving and restoring it reproduce the same organism, or does restoration quietly alter behaviour?
Hypothesis: If restore is faithful, a fresh process restoring a saved cache should decode the exact same action tokens as the original process.
Method: A create-process versus restore-process comparison. Temperature 0, fixed seed, 435-token prompt, about 70 MiB cache. Compare the first 8 decoded action token ids.
Controls: Two genuinely separate OS processes, not a same-process reload. Corroborated against a 1000-tick differential in which all 20/20 windows matched.
Result: 8/8 decoded token ids identical; ok: true. Restoration is deterministic under fixed seed and temperature: a saved run restores token-for-token. (Measured under the earlier architecture, when the model was run as the organism; see entry 010.)
Alternatives: Same-process caching masking a difference (ruled out by using separate processes). Coincidence on a short window (the corroborating 1000-tick run matched 20/20).

F-03 Supported 23 May 2026

A belief the body learned was load-bearing in the decision that ended a life

Single-life trace, n=1: demonstrates the mechanism exists, not how often it occurs.

Question: Can a belief the organism forms about its own body persist within a life and shape a later, consequential decision?
Hypothesis: If body-learned beliefs are causally active, a single perturbation should be traceable through a belief that is promoted, carried, and then cited in the dying decision.
Method: A single-life trace (Life 43): follow one perturbation through belief promotion, persistence, and the final decision window.
Controls: This is a single-life trace, n=1. It is reported as mechanism-existence only and makes no frequency claim.
Result: Perturbation at tick 30864; belief promoted at tick 30980 (confidence 0.80) and rendered every tick to death at tick 32313. Strict active-belief duration 1,332 ticks; the belief is cited in the decision window, ticks 32308 to 32312.
Alternatives: Coincidental timing (the belief is explicitly cited in the logged dying decision, not merely co-present). Over-reading one life (stated plainly as n=1).

F-04 Inconclusive 30 May 2026

Does the organism actively maintain itself? A signal withdrawn on review

Correction (2026-05-30): the closed-loop reading first logged on 2026-05-29 did not survive a deeper audit and is withdrawn. See Result.

Question: Does the organism do work to preserve its own coherence: notice a decayed self-belief and act, on its own, to restore it?
Hypothesis: If self-maintenance is real, then with no copyable instruction in the cue the organism should still close the loop: a decayed belief, a self-elected repair action, a rehearsal increment, and restored confidence.
Method: A single life under aggressive, engineered belief decay (configured to decay within about one tick), with the maintenance affordance enabled and no copyable instruction handed to the model. An earlier six-life run that suggested an enabled-versus-baseline asymmetry was found to be confounded by a copyable instruction in the cue; that confound was removed here.
Controls: The affordance-disabled control was run and produced the same belief-refresh as a trivial repeated-probe latch, which the scorer cannot distinguish from genuine closure. The no-decay baseline was not run. The test is not yet discriminating.
Result: On first analysis the loop appeared to close with no copyable instruction. On review the closed-loop credit did not hold: it rested on a coincidence in the scorer clock-ordering and scores zero under correct temporal semantics. What stands is narrow: the model elected a probe under a non-imperative cue, and one belief was refreshed (rehearsal 0 to 1, effective confidence 0.29 to 0.80), once, before re-decay. The phrase "strict closed self-maintenance loop" is withdrawn. As of 2026-05-30 the closure smoke test scores 0 and is blocked by an engineering launch bug (runtime recovery/readout), not a scientific negative.
Alternatives: The remaining observation does not separate self-maintenance from a repeated-probe latch. The decay was engineered-aggressive, not spontaneous; n=1; the readout layer is unstable; the refreshed belief content ("critical energy depletion") did not match the actual state (energy 0.72). A scorer with correct temporal semantics, and a control that separates genuine closure from a latch, are the next steps.

F-05 Refuted 20 May 2026

An artificial nervous system with no model at the centre cannot sustain open-ended behaviour

Question: Can a hand-built artificial nervous system, with no language model at the centre, sustain open-ended, non-repeating behaviour?
Hypothesis: A sufficiently rich hand-built dynamical substrate would self-organise into open-ended behaviour without a language model (the original Path A bet).
Method: Built and ran the Path A architecture (the artificial nervous system) to long horizons, observing the state trajectory.
Controls: Long-horizon observation of whether the trajectory kept changing or converged.
Result: Refuted, for the architecture as built. The system collapsed into a single deterministic attractor and stopped changing. Path A was retired and the project pivoted to Path B.
Alternatives: Insufficient richness rather than a fundamental limit: possible, but this line was not pursued after the pivot. The negative result stands for the architecture as built.

Retractions

No published finding has been retracted to date. When one is, it will stay on this page with its status set to retracted and a short note on why it fell. A finding that quietly disappears is a finding you cannot trust.

A status is a promise we can be held to.

Not: supported is not proven. It means the result has survived its controls so far, at the sample size stated, and nothing more.

Follow each result to its raw evidence →