Skip to content
Leisure Labs — home

Position

Machines That Return Time

Task completion is the wrong finish line. A system has done its job when the person on the other side of it has more time, and more capability, than they had before.

Amara Osei · Daniel Reyes · June Park

A quiet, naturally lit research room with a large worktable, notebooks, simple machines, and diagrams pinned to the wall.
A quiet, naturally lit research room with a large worktable, notebooks, simple machines, and diagrams pinned to the wall.

Abstract

Nearly every published evaluation of an AI system asks some version of the same question: did the machine do the task? Accuracy, pass rates, resolution time, tokens per dollar. These are necessary measures and badly incomplete ones. They describe the machine and say nothing about the person. In this piece we propose a complementary standard — returned time — and a discipline for measuring it: account for the hours a system gives back, observe what those hours become, and track whether the people using the system grow more capable or less. We then trace what this standard implies for the design of interfaces, evaluations, and infrastructure.

The narrow view of automation

The narrow view treats automation as substitution: a task moves from a person to a machine, and the gain is the difference in cost. Under this view, the ideal system is one that removes the person entirely. The narrow view is not wrong about payroll. It is wrong about everything else, because it assumes the value of a person's hour is the task that filled it.

A century of labor-saving technology tells a more complicated story. Mechanization did shorten the working day — from twelve hours to ten to eight — but at every step the gain had to be deliberately claimed, argued for, and designed into institutions; it was never delivered automatically by the machines themselves.1 Where the gains were not claimed, the freed hours were simply refilled: with more output, more coordination, more process. The machine saved labor and the person saw none of it.

Software has repeated this pattern at higher frequency. Email reduced the cost of sending a message and increased the number of messages until communication consumed more of the day than letters ever had. Much of what is now sold as productivity tooling manages friction that earlier productivity tooling created. If AI systems are built under the narrow view, there is no reason to expect a different outcome: tasks will be completed faster, and the people around them will be no freer than before.

Time as a human capacity

The alternative starts from a different premise: discretionary time is not a byproduct of efficiency but a capacity in its own right — arguably the root capacity, the one that all the others are funded by. Learning takes unhurried hours. Judgment ripens on time spent with a problem. Care, in any meaningful sense, cannot be batched. A person with no claim on their own attention cannot develop, whatever tools they own.

For most of history this capacity was scarce and held by very few. The time in which a small class read, argued, experimented, and governed was purchased with the unrelieved labor of everyone else. The deep promise of labor-saving machinery — visible to its sharpest observers from the beginning — was not cheaper goods. It was the wide distribution of unforced time, and with it the wide distribution of development itself.2

An assistant that saves an hour has not finished its job. The question is what the hour becomes.

Measuring what people are free to do

If returned time is the standard, it has to be measured, or it will remain a slogan. We think the measurement decomposes into three questions, each tractable with existing methods.

  • Was time actually returned? Not modeled time — observed time. Instrumented studies of real work, before and after a system is introduced, including the new work the system itself creates: reviewing, correcting, prompting, coordinating around it.
  • Where did the time go? Time-use research has mature instruments for exactly this question.3 If the hours saved by a drafting assistant are absorbed by a longer review queue, the system has redistributed labor, not reduced it.
  • Did capability grow or shrink? Measure the person, not just the output: skill retention, judgment quality, and confidence on related tasks performed without the system, sampled over months rather than sessions.

None of these measurements is exotic. What is missing is the habit of treating them as part of an AI evaluation at all — as load-bearing as a benchmark score, and as disqualifying when they fail.

Keeping people in the loop, on purpose

Human-in-the-loop design is usually framed as a safety concession: the model is not reliable enough, so a person checks it. This framing treats the person as scaffolding, to be removed when the model improves. We propose the opposite framing for a large class of systems: the person is the point, and the design question is which loop they should be in.

The automation literature documented the failure mode forty years ago: remove people from routine work and their skills decay, precisely the skills the remaining judgment calls require.4 The answer is not to keep people doing routine work. It is to design systems that route judgment to people and repetition to machines, and that keep the person's model of the problem intact — visible reasoning, contestable intermediate steps, summaries that teach rather than conceal.

A useful test: after six months with the system, could the person do the work better without it than they could have before they met it? Tools that pass this test exist — the spreadsheet arguably did more for numerical literacy than any curriculum. Tools that fail it produce dependence dressed as productivity.5

Evaluation beyond task completion

Concretely, we are building evaluations that sit alongside capability benchmarks rather than replacing them. Where a benchmark asks whether the model summarized the document, a capacity evaluation asks: how long did the human spend on document work this week, in total, compared to the instrumented baseline? Did they understand the corpus better or worse than the cohort working without assistance? When the system was removed for a session, had their unaided performance risen or fallen?

These evaluations are slower than benchmarks, involve real people, and resist leaderboards. We consider this a feature. A field that measures only what is fast to measure will optimize its systems toward exactly those measures, and the slow variables — skill, judgment, the shape of a working day — will drift unobserved.

Implications for interfaces and infrastructure

The standard reaches below evaluation into design. Interfaces: a system built to return time exposes its reasoning so the person learns the domain, not just the output; it prefers finishing a delegated task over maximizing engagement; an assistant whose success metric is daily active minutes is pointed in the wrong direction by construction. Infrastructure: much of the unnecessary work in any institution is friction between systems — retrieval, translation, reconciliation, status. Shared protocols and interoperable knowledge remove that work for everyone at once, which is why connective infrastructure, unglamorous as it is, may return more aggregate time than any single assistant.

Conclusion

The narrow view of automation will be realized by default; it requires nothing but momentum. The wider outcome — machines that return time, and people more capable for having used them — has to be designed for, measured, and defended, the way the eight-hour day was. That is an engineering agenda, not a sentiment. It is the agenda this lab exists to pursue.

Notes

  1. 1.The shorter-hours movement predates almost all modern labor legislation, and its gains tracked organized claims on productivity, not productivity alone. For the long arc, see Benjamin Hunnicutt, Work Without End (Temple University Press, 1988).
  2. 2.John Maynard Keynes, "Economic Possibilities for our Grandchildren" (1930), projected that technical progress would make the fifteen-hour week feasible within a century — and observed that the harder problem would be learning to use the freed time well.
  3. 3.See Jonathan Gershuny, Changing Times: Work and Leisure in Postindustrial Society (Oxford University Press, 2000), on diary-based time-use measurement.
  4. 4.Lisanne Bainbridge, "Ironies of Automation," Automatica 19, no. 6 (1983): 775–779.
  5. 5.The distinction between tools that enlarge personal competence and tools that create dependence is developed in Ivan Illich, Tools for Conviviality (Harper & Row, 1973).

More research

All research →
Interfaces

Interfaces for Collaborative Intelligence

Most AI interfaces assume one person and one model. Most meaningful work happens in groups. Design patterns from a year of building shared-context systems for small teams.

June Park, Tomás Carvalho

Evaluation

Measuring Human Capacity, Not Just Model Capability

A benchmark tells you what the model can do. It tells you nothing about what its users can still do. First results from paired capability–capacity evaluations.

Daniel Reyes, Priya Raghavan

Learning

Universal Tools for Learning

Learning tools concentrate where money and English already are. Notes on building systems that adapt to learners across languages, bandwidth, and prior schooling — by design rather than retrofit.

Noor Haddad, Amara Osei