Evaluation
Measuring Human Capacity, Not Just Model Capability
We ran the same study twice — once measuring the model, once measuring the people six weeks later. The two leaderboards do not agree, and the difference is the finding.
Daniel Reyes · Priya Raghavan
Model evaluation has converged on a comfortable shape: a fixed task set, a score, a leaderboard. The shape is comfortable because everything in it is fast and machine-legible. But the deployed reality of an AI system is a coupled human–machine system, and we have been measuring exactly half of it.
This spring we ran what we believe is a small but unusual study design: a paired evaluation. Eighty volunteer participants did six weeks of structured analytical work — data cleaning, literature synthesis, and statistical interpretation — with one of three assistant configurations, or none. We benchmarked the assistants on the underlying tasks (the capability axis), and we tested the participants before, during, and after, including sessions where the assistant was withheld (the capacity axis).
The two axes come apart
The headline: the configuration that scored highest on capability produced the largest decline in unaided participant performance — about eleven percent on transfer tasks at week six. It did the work most completely, explained itself least, and left the smallest residue of understanding. The configuration participants learned most from was a deliberately slower one that exposed intermediate steps and asked one verification question per task. It cost roughly four minutes more per session and left participants measurably better at the work itself.
We want to be careful about the claim. Eighty people, six weeks, one domain — this is a pilot, not a law of nature. But the direction of the effect was consistent across all task families, and it matches what the automation literature has reported in cockpit and control-room settings for decades. There is no reason to expect knowledge work to be exempt.
What a capacity benchmark would need
- A withheld-tool condition. Capacity is what remains when the system is taken away. Without unaided testing, deskilling is invisible by construction.
- Time horizons measured in weeks. Session-level studies flatter every system; skill decay and skill growth both need time to show.
- Transfer tasks. The question is not whether people remember the workflow, but whether they got better at the domain.
- Honest cost accounting. The enskilling configuration was slower. A field that only rewards speed will never build it.
We are releasing the study protocol and instruments, and we are expanding the design to two further domains this year. Our position is not that capability benchmarks are wrong — we use them daily. It is that a leaderboard with one axis is a steering wheel that only turns one way.
More research
All research →Machines That Return Time
AI is evaluated by whether it completes tasks. We argue for a complementary standard: whether it expands the time and capacity people have for judgment, learning, care, and shared work.
Amara Osei, Daniel Reyes, June Park
Interfaces for Collaborative Intelligence
Most AI interfaces assume one person and one model. Most meaningful work happens in groups. Design patterns from a year of building shared-context systems for small teams.
June Park, Tomás Carvalho
Universal Tools for Learning
Learning tools concentrate where money and English already are. Notes on building systems that adapt to learners across languages, bandwidth, and prior schooling — by design rather than retrofit.
Noor Haddad, Amara Osei