Skip to content
Leisure Labs — home

Infrastructure

Infrastructure for Trustworthy Iteration

The fastest lab is not the one that runs the most experiments. It is the one that never has to wonder whether last month's result still holds.

Lena Vogt · Yusuf Demir

There is a particular silence in a research meeting when a result fails to replicate internally — when February's promising number cannot be produced in March, and nobody can say whether the model changed, the data changed, the metric changed, or all three. Every hour after that silence is spent on archaeology rather than research. Multiply by a field, and the cost of untrustworthy iteration dwarfs the cost of compute.

We made a decision early: every experiment at the lab must be reproducible by a single command, by default, including the experiments nobody thinks will matter. This is a statement about infrastructure, because researchers will not maintain reproducibility by discipline alone — the system has to make the reproducible path the lazy path.

What that takes in practice

  • Append-only data with versioned views. Datasets are never edited in place; corrections are new versions with explicit lineage. Any result can name the exact data it saw, forever.
  • Evaluations as pinned artifacts. An evaluation is code plus data plus configuration, content-addressed together. "Score went up" is meaningless unless the yardstick is provably the same yardstick.
  • Experiment manifests, written by machines. Every run emits a manifest of inputs, code revision, environment, and outputs. Humans never transcribe hashes; humans are terrible at transcribing hashes.
  • Audit trails as a side effect. Because the above exist, the question "why do we believe this?" has a mechanical answer for any number the lab has ever produced.

Trust is a speed technology

The surprise was not that this infrastructure prevented errors — that was the design goal. The surprise was the change in pace. Researchers build faster on results they trust completely, the way builders work faster on foundations they have not been asked to doubt. Internal review compressed because reviewers verify mechanically instead of socially. New collaborators became productive in days, because the lab's entire empirical history is legible from its artifacts.

We describe this work because it is the least transferable kind of progress: nobody can use our trust. But the pattern transfers, and the pattern is the same one we argue for everywhere — the systems around the work should absorb the friction, so that human attention is spent on the parts only humans can do.

More research

All research →
Position

Machines That Return Time

AI is evaluated by whether it completes tasks. We argue for a complementary standard: whether it expands the time and capacity people have for judgment, learning, care, and shared work.

Amara Osei, Daniel Reyes, June Park

Interfaces

Interfaces for Collaborative Intelligence

Most AI interfaces assume one person and one model. Most meaningful work happens in groups. Design patterns from a year of building shared-context systems for small teams.

June Park, Tomás Carvalho

Evaluation

Measuring Human Capacity, Not Just Model Capability

A benchmark tells you what the model can do. It tells you nothing about what its users can still do. First results from paired capability–capacity evaluations.

Daniel Reyes, Priya Raghavan