Multi-Agentic Software Development Is a Distributed Systems Problem
TL;DR Highlight
The problem of multiple LLM agents collaborating to create software is fundamentally a distributed consensus problem, and this inherent limitation does not disappear as models become more intelligent.
Who Should Read
Developers designing or operating multi-agent pipelines, or AI engineers concerned about the stability and consistency of LLM-based automation systems.
Core Mechanics
- The author directly refutes the prevalent view in the industry that 'agent coordination problems will be solved as models improve.' Impossibility results from distributed systems theory already exist, independent of model capabilities.
- Natural language prompts are inherently underspecified. That is, multiple consistent programs can exist for a single prompt P, and the LLM 'selects' one of them.
- In multi-agent development, when each agent A1~An implements different components φ1~φn, the condition that the final result shares a single consistent interpretation is mathematically equivalent to the distributed consensus problem.
- A design decision by one agent constrains the choices of other agents. For example, if the network agent chooses a callback-based asynchronous API library, the integration agent must configure the infrastructure accordingly.
- The author argues that the FLP impossibility result (deterministic consensus is impossible in asynchronous distributed systems) also applies to this problem. However, comments raise the counterargument that LLM agents are probabilistic entities, so FLP may not apply directly.
- The author emphasizes external verification (tests, compilation, linting, etc.) as a key mechanism for transforming Byzantine faults (errors where participants send incorrect information) into crash faults (errors where participants simply stop). Without tests, it is impossible to even detect if an agent has made an incorrect interpretation.
- The author stated that they are researching a new formal language that combines choreographic languages (formal languages that describe the interactions of distributed participants from a holistic perspective) and game theory to address this problem.
- Partial synchrony (a distributed system model that assumes an upper bound on message delay) is mentioned as a realistic escape from FLP, and it is believed to be implementable through iterative improvement loops.
Evidence
- A developer who has actually operated a multi-agent pipeline shared that they reached the same conclusion by adopting a plan→design→code sequential stage and having deterministic verification gates such as compilation/lint at each stage. Deterministic gates provide a lower bound on assurance, while agent reviewers provide a probabilistic upper bound, presenting a practical framework.
- There was a technical counterargument to applying the FLP impossibility result. FLP concerns deterministic consensus, while LLM agents are probability distributions, i.e., inherently probabilistic entities. Like Ben-Or's (1983) randomized consensus algorithm, which bypasses FLP with a 'flip a coin if stuck' strategy, agent systems should also be viewed within a randomized consensus framework.
- It was pointed out that the Byzantine fault assumption (participants fail independently) does not hold because LLM agents share the same weights and training data. When prompts are ambiguous, agents do not make errors in different directions but are biased in the same direction, which is more dangerous because it cannot be caught by majority voting.
- A practical connection was presented where the bounded timeout of workflow engines like Temporal maps to the message delay upper bound of the DLS (Dwork-Lynch-Stockmeyer) partial synchrony model. However, it was also pointed out that even if infrastructure-level retries succeed, the 'semantic idempotency' problem remains unsolved – LLM re-calls can produce different outputs.
- A counterargument was made that mathematical results apply equally to human agents, and that large codebases like Linux were created by humans. That is, mathematics does not prove what AI cannot do, and perspectives like Conway's Law, where the architect's role is key, were also presented.
- A developer who has operated a real team of 3-4 agents shared that using one agent as a supervisor to handle PR reviews and conflict resolution worked well at that scale. However, they also added that the supervisor became a single coordination bottleneck, like a human tech lead.
How to Apply
- When designing multi-agent pipelines, be sure to place deterministic verification gates such as compilation, linting, and type checking at each boundary between agents. This will transform an agent's incorrect interpretation into a detectable failure before it propagates to the next agent, downgrading Byzantine faults to simpler crash faults.
- If agents work in parallel, first have one agent finalize and explicitly document shared design decisions (API style, data types, library choices, etc.) as explicit artifacts (spec documents, interface definitions) before passing them on to other agents. This is equivalent to clarifying 'shared state' in the distributed consensus problem and reduces semantic drift between agents.
- If you are using a workflow engine like Temporal, simply explicitly setting activity timeouts implements the message delay upper bound of the partial synchrony model. However, real consistency is guaranteed only if verification gates are passed even after retries, as LLM output may change after a retry.
- As the number of agents increases, a single supervisor agent will struggle to handle the entire context. In this case, introduce a hierarchical supervision structure (e.g., supervisor per subteam + top-level coordinator), but design with verification gates at each level boundary to distribute the bottleneck.
Terminology
distributed consensusA problem where multiple independent participants must agree on a single common value. Think of multiple servers connected only by the internet agreeing on whether 'this transaction is valid'.
FLP impossibilityA theorem proven by Fischer, Lynch, and Paterson in 1985, stating that deterministic consensus is impossible in asynchronous distributed systems even if only one participant fails. It is a fundamental limitation in environments where network delays can be infinite.
Byzantine faultAn error type in distributed systems where a participant not only stops but also sends incorrect or malicious messages. A situation where an agent quietly generates incorrect code corresponds to this.
choreographic languageA formal language that describes the overall interaction in a distributed system, not from the perspective of each participant. Like a dance choreography, it expresses 'who sends what to whom and when' as a complete scenario.
partial synchronyA distributed system model that assumes an unknown upper bound on message delivery. More realistic than complete asynchrony (the limitation of FLP), and consensus algorithms can be created under this assumption.
semantic idempotencyThe property that executing the same operation multiple times yields the same result. LLMs do not guarantee this property because they can produce different outputs even when re-executing the same prompt, which conflicts with retry logic in distributed systems.