Multi-agent systems theory is roughly forty-five years old. It has formal definitions of what an agent is, protocols for who does what, a complexity result that says coordination is provably hard, and an economics for making selfish processes behave. The 2023+ LLM agent stack is re-deriving nearly all of it — usually without the citations. A field guide, with every concept illustrated twice: once from the literature, once from the running code of our agent holding.
Multi-agent systems — MAS, in the literature — sits at the intersection of distributed AI, game theory, economics and control theory. It studies how multiple autonomous decision-makers, each with local information and its own goals, produce coherent (or incoherent) global behavior. The field’s canon was largely written between 1980 and 2002: the Contract Net Protocol and the Hearsay-II blackboard both date to 1980, Brooks’ subsumption architecture to 1986, the BDI formalism to 1991, the standard definition of “agent” to 1995, and the complexity result that explains why all of this is genuinely difficult to 2002.
Then, starting in 2023, a much larger engineering community began building multi-agent systems with LLMs as the reasoning core — orchestrators, role-playing software companies, debate panels, agent protocols — and started re-deriving the canon from scratch, mechanism by mechanism, mostly without naming it.
We have a convenient specimen for checking that claim. H0LD1NG — the system from our previous field notes — is an operating system for companies staffed by AI agents: org charts in YAML, a generic runtime, a task board, memos, budgets, an event bus, and a meta-agent that designs new companies. It was built bottom-up from engineering needs, not from the MAS literature. Which makes the convergence the interesting part: nearly every guard, channel and rail in its source has a name in a paper written before 2003. This article walks the theory in order, and at each stop shows the line of code that reinvented it.
The field’s standard definition comes from Wooldridge & Jennings (1995), who deliberately offer a weak notion of agency built on four properties: autonomy (“agents operate without the direct intervention of humans”), social ability (“agents interact with other agents … via some kind of agent-communication language”), reactivity (perceive the environment and “respond in a timely fashion”), and pro-activeness (“goal-directed behaviour by taking the initiative”). A stronger notion, they note, additionally describes agents with mentalistic vocabulary — knowledge, belief, intention, obligation — the ground later formalized as BDI.
A system becomes multi-agent when there are several of these, no global
controller scripting each move, and outcomes that emerge from interaction rather
than from any single agent’s plan. Run the checklist against one H0LD1NG persona —
say a content-writer at the mock SEO agency. Autonomy: each wake-up is a sealed LLM
session that decides for itself what the board state demands; no human in the loop.
Reactivity: event triggers (“wake when task.created with
role: engineer”). Pro-activeness: cadences (daily@09:30) —
the agent acts on its own clock, not just on stimuli. Social ability: the task board
and memo channel are the only way personas touch each other. Four for four — and the
runtime dispatching them holds no script of what happens next, only physics:
schedules, budgets, guards.
Classical agent design has two poles and a middle. The deliberative pole rests on Newell & Simon’s physical-symbol-system hypothesis: the agent “contains an explicitly represented, symbolic model of the world” and decides “via logical (or at least pseudo-logical) reasoning.” The reactive pole is Brooks’ subsumption architecture (1986): no central world model, no symbolic reasoning — “a hierarchy of task-accomplishing behaviours” where lower layers take precedence. The middle is BDI (Rao & Georgeff 1991), which made intention a first-class mental attitude with “equal status with the notions of belief and desire,” formalized in branching-time logic and implemented in systems like the Procedural Reasoning System.
Modern LLM agents are deliberative-hybrids, and H0LD1NG makes the layering unusually
legible because it implements the loop twice: once delegated to the Claude Agent SDK,
and once hand-rolled for OpenAI-compatible providers — model call → parse tool calls →
execute → feed results back → repeat (apiRunner.ts, whose header notes
“What the SDK gives Claude for free, this loop provides itself”). The LLM is the
deliberative core; the tool dispatch and file sandbox are the reactive layer that
actually touches the world.
The BDI mapping is the satisfying one, with a twist the 1991 paper didn’t anticipate:
the mental state lives outside the agent. Beliefs are the
board, memos and files — re-read from disk at every wake-up, because the runner sets
persistSession: false and the system prompt warns that “anything not
written to the board, a memo, or a file is lost.” Desires are the
mission and KPIs in the company YAML. Intentions are claimed tickets:
moving a task to in_progress stamps claimedBy with your
persona key, and a rule set (computeClaim) releases the commitment if the
work is bounced, reassigned, or its owner crashes. An agent here is a memoryless
process; the organization is the mind.
The founding mechanism of MAS task allocation is Reid Smith’s Contract Net Protocol (IEEE Trans. Computers, 1980): managers announce tasks with eligibility specifications; “available contractors evaluate task announcements … and submit bids on those for which they are suited. The managers evaluate the bids and award contracts to the nodes they determine to be most appropriate.” The process recurses — a contractor may partition its task and become a manager for the parts. Forty-five years later this is recognizably the orchestrator-worker pattern, a mapping made explicitly in recent work on agent contracts and LLM task brokering (e.g. arXiv:2504.21030, arXiv:2506.01900).
H0LD1NG’s cross-company outsourcing broker is a contract net with one phase
deliberately amputated. A role calls outsource_task, naming a sibling
company; the orchestrator validates and immediately creates a linked ticket in the
target — announcement and award fused into a directed grant, no bid round, no
comparative evaluation. The target’s own triggers fire on task.created;
on completion the artifacts are copied back into the source’s
inbox/outsourced/ and the source ticket moves to review. Even Smith’s
recursion limit shows up, inverted: “A task can be outsourced only once, an
outsourced-in task cannot be re-outsourced (no chains)” — contracting depth pinned
to one, so the dependency graph stays flat and the copy-back has a single,
unambiguous home. Update, hours after first publication: the amputation is
now optional. Leave the target unnamed and the broker runs the missing phase by
proxy — every sibling scored (vitality − 2·queue depth − 5·cost-per-done-task, with
a cold-start penalty so an unproven company can’t win on an empty ledger), best bid
wins, candidate table recorded on the ticket. The contract net got its market back.
Inside a company, allocation is capacity-bounded rather than negotiated: each role
materializes one slot per seat (clamped 1..8), dispatch marks a slot busy up front so
concurrent passes can’t double-book it, and a missed cadence while all seats are full
is dropped — not queued — with run.skipped{busy}, because
periodic work is idempotent over board state and stale fires would only be waste.
Event triggers do queue, FIFO per role, capped at 50 with drop-oldest: a bounded
mailbox that sheds the stalest obligations to stay live.
The same year as the contract net, the Hearsay-II speech-understanding system (Erman, Hayes-Roth, Lesser & Reddy, 1980) gave MAS its other great coordination pattern: independent knowledge sources that never call each other, communicating solely through a shared global database. “KSs communicate through a global database called the blackboard … It represents intermediate states of problem-solving activity, and it communicates messages (hypotheses) from one KS that activate other KSs.” Replace “knowledge source” with “role” and “hypothesis” with “ticket” and you have described a 2026 agent task board — a lineage modern blackboard revivals now cite directly (arXiv:2507.01701, arXiv:2510.01285).
H0LD1NG is a blackboard system to the letter. Roles never message each other; the system prompt is blunt about it:
“Coordinate with other roles ONLY through the task board and memos (mcp__company__* tools). Other roles cannot see your session — anything not written to the board, a memo, or a file is lost.” — role system prompt, packages/core/src/runner/promptBuilder.ts
The board’s status lifecycle (backlog → in‑progress → review → done) is hypothesis
refinement with a credibility gate: work may not move to done until a reviewer —
just another role subscribed to status: review — passes it. The
workspace paths are conventions in the Lewis sense, an arbitrary-but-shared
coordination equilibrium: the SEO agency’s shared context mandates
research/<client>/, briefs/<client>/,
notes/<role-id>.md, so researchers and writers hand off artifacts
without ever addressing one another.
Shoham & Tennenholtz (AAAI 1992; AIJ 1995) proposed designing social laws offline — “constraints on the behavior of agents” specifying “which of the actions that are in general available are in fact allowed in a given state” — so that coordination needs neither a central arbiter nor negotiation. Their canonical example is traffic: adopt keep-to-the-right, and all head-on collisions are avoided “without any need for either a central arbiter or negotiation.” H0LD1NG ships two such laws in the engine, where they bind regardless of what any model thinks: an event produced by a role’s own run never re-triggers that role (one line in the dispatcher), and a sliding-window guard caps re-triggers of the same (role, subject) at six per ten minutes. The source comment names the exact failure it outlaws:
“The self-trigger guard only blocks a role from re-triggering itself; it does nothing for cross-role ping-pong (e.g. a reviewer bouncing a task back to a maker who re-submits it for review, forever). … Beyond the cap we drop the trigger and emit run.skipped so a runaway loop dampens itself instead of burning spend.” — packages/core/src/orchestrator.ts
A third class of law lives only in the prompts — the SEO auditor’s “Never approve your own work”, the software house’s “Never assign two engineers overlapping files at the same moment — slice by feature, not layer”, Trading Glass’s HARD CONSTRAINT against ever touching live trading systems, repeated at the org, role and KPI layers. The distinction matters and the theory predicted it: an engine-enforced law is a guarantee; a prompt-level norm is a request. H0LD1NG’s designers put the loop-termination laws in code and the etiquette in prose — which is exactly where each belongs.
inform (§06) and wake nobody, and a pair that keeps tripping the guard escalates into a six-hour mute plus an executive memo (§10).When agents are self-interested — or merely fallible — MAS borrows its formal backbone from economics. Outcomes are analyzed as games (Nash equilibria, with the prisoner’s dilemma as the canonical gap between individual and collective rationality), and mechanism design runs the analysis backward: you design the rules so that even selfish play produces an acceptable global outcome. The textbook results are auctions. In a second-price sealed-bid auction — the Vickrey auction, 1961 — “truth telling is a dominant strategy” (Shoham & Leyton-Brown, Thm. 11.1.1), and its generalization, the VCG mechanism, makes “each agent … pay his social cost — the aggregate impact that his participation has on other agents’ utilities.” The deep idea is not auctions; it’s that the rules, not the agents, carry the burden of good behavior.
LLM agents are not strategic bidders, but they are budget-burning processes with
unreliable judgment, which is close enough for the institution-design lens to bite.
H0LD1NG’s money rails read like a mechanism-design checklist. Per-run caps bound any
single agent’s deliberation (40 turns, $1.00 — bounded rationality as a config key).
The daily company budget is a hard ceiling under concurrency: every
dispatched run reserves its worst-case cost up front, and the guard admits new work
against settled-plus-reserved spend — otherwise, as the source puts it, “one
dispatch pass admits every free slot against the same stale settled total and
overshoots by (concurrent_slots × maxBudgetUsd).” An unset budget is not unlimited
(“An unset budget must not mean ‘unbounded spend’” — a $25/day backstop applies, and
the breach event is flagged defaulted so you know which rule fired).
Two further rules show real institutional thinking. A paid model with no entry in the price table fails the run, because “a paid provider with no known price would bill $0, so NEITHER the per-run dollar cap NOR the per-day budget could ever see its real spend … Fail the run rather than run uncapped on a real account” — if you cannot meter an action, you must not let the agent take it, or the whole mechanism is silently defeated. And the ledger that all of this reads from is the one file the store fsyncs (“the ledger is the money + audit record”), surviving even daemon death: interrupted runs are reconciled as failed on the next start so the cap never relaxes across a crash. Above it all sits the Architect — a meta-agent whose single tool validates a proposed company against the same schema the runtime executes. The designer proposes; the institution adjudicates. That is mechanism design applied to organizations themselves.
Classical MAS took from philosophy the idea that messages are acts, not
data: an utterance informs, requests, proposes, commits. KQML (Finin et al., 1994)
built an agent language from such performatives, and FIPA-ACL standardized a library
of 22 communicative acts — among them inform (“The
sender informs the receiver that a given proposition is true”), request
(“The sender requests the receiver to perform some action”), propose,
and cfp, the call-for-proposals that powers contract nets. The
standards went dormant; the idea did not. The modern protocol layer rediscovers the
split at a different altitude: Anthropic’s MCP (Nov 2024) is “an open standard that
enables developers to build secure, two-way connections between their data sources
and AI-powered tools” — the agent↔tool half — while Google’s A2A (April 2025, since
donated to the Linux Foundation) covers the agent↔agent half, with capability-advertising
Agent Cards and a stateful task lifecycle. The cleanest one-line split is A2A’s own:
“A2A is about agents partnering on tasks, while MCP is more about agents using
capabilities.”
H0LD1NG is an honest case study in where performatives actually went — and in what
happens when an audit reads its own conclusions. As first published, its memos had
the ACL envelope — from, to (a role or all), subject, body,
readBy — but no performative field: whether a memo was a directive, a status report
or a question lived in prose (“Use memos for direction, status, hand-offs and
questions”). Hours after this article shipped, the field arrived: every memo now
carries kind: direct | inform | request | handoff | question, and the
performative has wake semantics — an inform memo wakes nobody
(it waits for the recipient’s next natural wake), the action kinds fire triggers as
before, and a trigger can filter on it (when: {kind: request}).
FIPA’s 22 acts became five, with scheduling consequences. The other half of the
observation stands: the strongest performatives live in the typed tool calls. A
board_update that sets status: in_progress together with
role: content-writer is not a message about work — it is the
bounce-back act itself, with system-stamped authorship (“always correct and never
trusted to the model”) and machine-readable consequences: the orchestrator’s
dispatcher treats exactly that field pair as the contract that wakes the other side.
Speech-act theory survives — compiled into JSON schemas.
The single most load-bearing result in MAS is a complexity theorem. Sequential decision-making under uncertainty has a ladder: a fully observed single-agent problem is an MDP (P-complete — easy); hide part of the state and it becomes a POMDP (PSPACE-complete — hard); now give the problem to a team of agents, each with its own partial, different observations, and you have a DEC-POMDP. Bernstein, Givan, Immerman & Zilberstein (2002) proved the finite-horizon case NEXP-complete — even for two agents. Their own gloss: the problems “provably do not admit polynomial-time algorithms,” a “fundamental difference between centralized and decentralized control,” and “mathematical evidence corresponding to the intuition that decentralized planning problems cannot easily be reduced to centralized problems and solved exactly using established techniques.”
Read that as systems guidance and it says: optimal decentralized coordination is computationally out of reach, so a real system should not chase it with cleverness — it should bound the damage with structure. That is precisely the shape of every mechanism in §§03–05. H0LD1NG’s agents act on a role-local projection of state (the prompt carries the board summary, your tasks, your unread memos — “Other roles cannot see your session”): partial observability by construction. The reservation-based budget guard exists because concurrent dispatch decisions against a stale shared total are exactly the kind of joint decision a DEC-POMDP punishes. And the loop guard exists because a maker/reviewer pair has no global view of their joint cycle — the theory says giving them one is NEXP-expensive; the engineering answer is a rate limiter.
Multi-agent reinforcement learning hits the same wall from a different side. With several learners adapting at once, “the Markov property … becomes invalid since the environment is no longer stationary” (Hernandez-Leal et al.’s survey) — each agent’s improvement is every other agent’s distribution shift, the “moving target” problem. What it took to beat it in practice is a measure of the difficulty: OpenAI Five defeated the Dota 2 world champions on April 13, 2019 after a single training run of ten months and 770 ± 50 PFlops/s·days of compute — then won 99.4% of 7,257 public games; AlphaStar reached Grandmaster, above 99.8% of ranked humans, by training an entire league of “continually adapting strategies and counter-strategies” rather than one agent. The flip side of emergence-as-problem is emergence-as-resource, and it is old: Reynolds’ 1987 boids produced flocking from three local rules (collision avoidance, velocity matching, flock centering — the “separation, alignment, cohesion” naming came later), “the result of the dense interaction of the relatively simple behaviors of the individual simulated birds.” H0LD1NG’s vitality index is a small wager on the same idea — a company-level health score no single agent computes or owns — and its adaptation engine, which reacts to moving KPIs, carries a 12-hour per-metric cooldown for exactly the moving-target reason: an org that re-aims at every twitch of its own measurements never converges.
The 2023 wave arrived as societies before it arrived as engineering. CAMEL framed two role-playing agents steered by “inception prompting” as a way to study “a society of agents.” Generative Agents put 25 personas in a Sims-style town with a memory stream, reflection and planning — and watched a single seeded intention (“one agent wants to throw a Valentine’s Day party”) diffuse, autonomously, from one agent (4%) to thirteen (52%) in two simulated days, while the social network densified from 0.167 to 0.74. Ablating the memory architecture didn’t dent believability, it demolished it: the full architecture versus the prior-work baseline measured a standardized effect size of d = 8.16. Then came the software companies: MetaGPT encoded “Standardized Operating Procedures (SOPs) into prompt sequences” and hit 85.9% / 87.7% on HumanEval / MBPP; ChatDev ran a waterfall of designing-coding-testing-documenting chats that shipped an app “in under seven minutes at a cost of less than one dollar” (v3 figures), with 86.66% of generated systems executing flawlessly. The frameworks followed — AutoGen (Aug 2023) made “conversable” agents a programming model; OpenAI shipped the experimental Swarm (Oct 2024), superseded by the Agents SDK (Mar 2025) with handoffs and guardrails as primitives; Claude Code grew subagents, each running “in its own context window with a custom system prompt, specific tool access, and independent permissions” — and, pointedly, unable to spawn subagents of their own. A depth limit. Smith would recognize it.
Du et al.’s multiagent debate (2023, ICML 2024) is the cleanest controlled positive: several copies of the same model answer independently, read each other’s answers, revise, repeat. With three agents and two rounds on ChatGPT, arithmetic went 67.0→81.8%, grade-school math 77.0→85.0%, biography factuality 66.0→73.8%, MMLU 63.9→71.1%, chess-move validity 29.3→45.2% — and, intriguingly, debate sometimes recovers the right answer when every agent starts wrong. “More Agents Is All You Need” (Li et al., Feb 2024) showed an even blunter lever: pure sampling-and-voting scales with ensemble size — gains of 12–24% on GSM8K and 6–10% on MATH, with a 15-sample Llama2-13B (59%) overtaking a single Llama2-70B (54%) — though returns taper at extreme task difficulty. And the headline production number: Anthropic’s research system (June 2025), an orchestrator-worker design, “outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval.”
The same writeup, read closely, is also the field’s best cost disclosure. Agents use ~4× the tokens of chat; “multi-agent systems use about 15× more tokens than chats.” On BrowseComp, “token usage by itself explains 80% of the variance” in performance (three factors together: 95%). Which invites the deflationary reading the skeptics formalized: maybe multi-agent systems work mainly because they are a socially acceptable way to spend more tokens.
Hold compute constant and much of the magic evaporates. Smit et al.’s “Should we be going MAD?” (ICML 2024) benchmarked debating systems against simpler strategies and found they “do not reliably outperform … self-consistency and ensembling” — on MMLU, self-consistency scored 0.78 against ≤0.74 for every debate variant tested — while remaining far more hyperparameter-sensitive. Tran & Kiela’s matched-compute study (arXiv:2604.02460, 2026) swept thinking budgets from 100 to 10,000 tokens across five multi-agent architectures and found the single-agent system “the best-performing system or statistically indistinguishable from the best for all budgets except the lowest one,” concluding that “many reported MAS gains are better explained by compute and context effects than by inherent architectural superiority.” Debate has a specific failure mechanism: under social pressure, “models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning” (“Talk Isn’t Always Cheap”, 2025). And practitioners said it plainest — Cognition’s “Don’t Build Multi-Agents” (June 2025) argues context must be shared in full (“Share context, and share full agent traces, not just individual messages”), that parallel subagents make implicitly conflicting decisions, and that “in 2025, running multiple agents in collaboration only results in fragile systems.” Even Anthropic’s own postmortem lists early agents “spawning 50 subagents for simple queries.”
The largest failure study is Berkeley’s MAST (“Why Do Multi-Agent LLM Systems Fail?”, 2025): 1,642 annotated execution traces across 7 frameworks, a taxonomy of 14 failure modes in 3 categories — system design issues, inter-agent misalignment, task verification — built from expert analysis with κ = 0.88 inter-annotator agreement. The systems studied failed at rates from 41% to 86.7%. The headline conclusion is the one this whole article has been circling: “Consistent with organization theories, our findings indicate that many MAS failures arise from the challenges in organizational design and agent coordination rather than the limitations of individual agents.” Interventions back it up — better role specifications alone bought ChatDev +9.4% task success; adding a verification step, +15.6%. The fix for multi-agent systems is org design, which is to say: §§03–06 of this article. Topology research lands the same way from the efficiency side: AgentPrune showed that one-shot pruning of the inter-agent message graph cuts 28.1–72.8% of tokens while matching state-of-the-art topologies ($5.6 against their $43.7); GPTSwarm treats the whole society as an optimizable graph. Most of what agents say to each other, measured, is padding.
Synthesis, honestly labeled as ours: multi-agent helps where the task is wide — parallel branches, more total context than one window holds, many tools — and the coordination surface is thin. It hurts where the task is deep and every decision depends on every other. The theory predicted both halves: parallelizable subproblems are the case where the DEC-POMDP’s joint-planning cost doesn’t bind; tightly coupled ones are where it bites hardest, and where Cognition’s single-threaded-agent advice is just the theorem worn as engineering taste.
The running claim of this article, in one table. Each row pairs a classical result with the thing your agent stack calls it and the line of H0LD1NG that instantiates it. The last column is the part most syntheses skip: whether the mapping is published in the literature or is our own analogy. Both kinds are useful; only one kind is citable.
| classical concept | 2026 vernacular | h0ld1ng instance | mapping |
|---|---|---|---|
| Contract Net Protocol Smith 1980 |
orchestrator-worker dispatch; outsourcing | cross-company broker: announce + award + report, one-hop limit — and, post-publication, the bid round too: omit the target and the broker auctions by vitality/queue/cost (orchestrator.ts) | ✓ published arXiv:2504.21030, 2506.01900 |
| Blackboard system Hearsay-II, 1980 |
shared task board, scratchpad, plan file | the board as sole coordination channel; tickets as hypotheses; review as credibility gate (companyStore.ts) | ✓ published arXiv:2507.01701, 2510.01285 |
| Joint intentions / SharedPlans Cohen & Levesque ’90; Grosz & Kraus ’93 |
task claiming, hand-offs, ownership | claimedBy = individual commitment; computeClaim bounce/release rules; crash → claims released |
◆ our synthesis partial — no mutual-belief protocol |
| Social laws Shoham & Tennenholtz ’92/’95 |
guardrails, rate limits, loop breakers | self-trigger guard; loop guard (≤6 per subject / 10 min); both engine-enforced, model-independent — chronic loops now escalate (6h mute + an executive memo) | ◆ our synthesis tight — offline-designed action constraints |
| Mechanism design / VCG Vickrey ’61; Clarke ’71; Groves ’73 |
budget caps, spend governance | reservation-based daily cap; $25 defaulted backstop; unpriced-model refusal; fsync’d ledger | ◆ our synthesis rules carry the burden, not agents |
| Speech-act performatives KQML ’94; FIPA-ACL (22 acts) |
typed tool calls, structured outputs | board_update {status, role} as the bounce-back act — “the trigger contract roles code to”; memos now carry kind (informs wake nobody) |
◆ our synthesis |
| DEC-POMDP hardness Bernstein et al. 2002 |
“why did adding agents make it worse” | role-local prompts (partial views); guards instead of joint plans; the loop guard as NEXP-avoidance | ◆ synthesis, standard framing hardness is the published part |
| MARL non-stationarity Hernandez-Leal et al. survey |
agents adapting to each other mid-flight | pessimistic reservation accounting; adaptation engine’s 12h cooldown as anti-windup | ◆ our synthesis |
| Organization theory of failure MAST, 2025 |
role specs, verification steps | archetypes + acceptance criteria + a reviewer role wired to status: review — the two interventions MAST measured, as YAML; reviewer-less closes of criteria-carrying tickets are now mechanically refused without an on-record comment |
✓ published MAST’s own conclusion |
A field guide that ends in advice should be willing to take it. Hours after this
article first shipped, the six strongest mappings above went back into the codebase
as mechanisms (commit cbb0887), each traceable to a section here —
implemented, adversarially reviewed, and verified against a live mock holding:
kind: direct | inform | request | handoff | question; informs wake nobody, and triggers can filter on kind. FIPA’s lesson, compiled. → §06loop.escalated fires, and the executive gets a request memo: the silent damper became an organizational signal. → §04
The CEO↔PM ping-pong of FIG 06 — this article’s favorite specimen — is structurally
extinct: the status memo that fed it is typed inform now, and the next
holding dataset will not reproduce it. The chart documents a failure mode its own
caption helped fix.
The strongest version of this article’s claim is not “everything was known in 1995.” The LLM changed the economics of deliberation so completely that the field’s center of gravity moved: classical MAS spent decades on how agents decide, and the decision core is now a commodity you rent by the token. What did not change is everything around the core — task allocation, shared state, conventions, incentives, the hardness of joint planning — because none of that ever depended on how the agent thinks. That is why a task board built from engineering instinct in 2026 lands within arm’s reach of Hearsay-II, and why MAST’s 1,642 failure traces read like an organization-theory syllabus.
So the practical advice is almost embarrassingly cheap: the literature is a free design review. If your orchestrator re-dispatches work, Smith 1980 already wrote the protocol and its termination conditions. If your agents share a plan file, Hearsay-II already worked out activation. If two of your agents can wake each other, Shoham & Tennenholtz already argued the guard belongs in the rules, not the agents. And if adding a sixth agent made everything slower and worse — that isn’t a bug in your framework. It’s a theorem, and it has been NEXP-complete since 2002.
Everything in our dataset is replayable: four mock companies plus one very busy
software house, 826 runs, $23.31 of simulated spend, every event in append-only
files. The reset ritual is unchanged from the last article — stop the daemon,
rm -rf workspaces/<id>, and forty-five years of theory boots
again in about four seconds.
$ node packages/cli/dist/index.js start software-house --mock $ open http://localhost:4733 # watch a 1980 blackboard fill with 2026 tickets
Every figure above was checked against its primary source by an independent verification pass; quotes are verbatim. Where a number lives only in a table (not a prose sentence), or a date rests on secondary reporting, the text says so.