Multi-Agentic Software Development Is a Distributed Systems Problem
TL;DR Highlight
The problem of multiple LLM agents collaborating to create software is fundamentally a distributed consensus problem, and this inherent limitation does not disappear as models become more intelligent.
Who Should Read
Developers designing or operating multi-agent pipelines, or AI engineers concerned about the stability and consistency of LLM-based automation systems.
Core Mechanics
- The author directly refutes the prevalent view in the industry that 'agent coordination problems will be solved as models improve.' Impossibility results from distributed systems theory already exist, independent of model capabilities.
- Natural language prompts are inherently underspecified. That is, multiple consistent programs can exist for a single prompt P, and the LLM 'selects' one of them.
- In multi-agent development, when each agent A1~An implements different components φ1~φn, the condition that the final result shares a single consistent interpretation is mathematically equivalent to the distributed consensus problem.
- A design decision by one agent constrains the choices of other agents. For example, if the network agent chooses a callback-based asynchronous API library, the integration agent must configure the infrastructure accordingly.
- The author argues that the FLP impossibility result (deterministic consensus is impossible in asynchronous distributed systems) also applies to this problem. However, comments raise the counterargument that LLM agents are probabilistic entities, so FLP may not apply directly.
- The author emphasizes external verification (tests, compilation, linting, etc.) as a key mechanism for transforming Byzantine faults (errors where participants send incorrect information) into crash faults (errors where participants simply stop). Without tests, it is impossible to even detect if an agent has made an incorrect interpretation.
- The author stated that they are researching a new formal language that combines choreographic languages (formal languages that describe the interactions of distributed participants from a holistic perspective) and game theory to address this problem.
- Partial synchrony (a distributed system model that assumes an upper bound on message delay) is mentioned as a realistic escape from FLP, and it is believed to be implementable through iterative improvement loops.
Evidence
- A developer who has actually operated a multi-agent pipeline shared that they reached the same conclusion by adopting a plan→design→code sequential stage and having deterministic verification gates such as compilation/lint at each stage. Deterministic gates provide a lower bound on assurance, while agent reviewers provide a probabilistic upper bound, presenting a practical framework.
- There was a technical counterargument to applying the FLP impossibility result. FLP concerns deterministic consensus, while LLM agents are probability distributions, i.e., inherently probabilistic entities. Like Ben-Or's (1983) randomized consensus algorithm, which bypasses FLP with a 'flip a coin if stuck' strategy, agent systems should also be viewed within a randomized consensus framework.
- It was pointed out that the Byzantine fault assumption (participants fail independently) does not hold because LLM agents share the same weights and training data. When prompts are ambiguous, agents do not make errors in different directions but are biased in the same direction, which is more dangerous because it cannot be caught by majority voting.
- A practical connection was presented where the bounded timeout of workflow engines like Temporal maps to the message delay upper bound of the DLS (Dwork-Lynch-Stockmeyer) partial synchrony model. However, it was also pointed out that even if infrastructure-level retries succeed, the 'semantic idempotency' problem remains unsolved – LLM re-calls can produce different outputs.
- A counterargument was made that mathematical results apply equally to human agents, and that large codebases like Linux were created by humans. That is, mathematics does not prove what AI cannot do, and perspectives like Conway's Law, where the architect's role is key, were also presented.
- A developer who has operated a real team of 3-4 agents shared that using one agent as a supervisor to handle PR reviews and conflict resolution worked well at that scale. However, they also added that the supervisor became a single coordination bottleneck, like a human tech lead.
How to Apply
- When designing multi-agent pipelines, be sure to place deterministic verification gates such as compilation, linting, and type checking at each boundary between agents. This will transform an agent's incorrect interpretation into a detectable failure before it propagates to the next agent, downgrading Byzantine faults to simpler crash faults.
- If agents work in parallel, first have one agent finalize and explicitly document shared design decisions (API style, data types, library choices, etc.) as explicit artifacts (spec documents, interface definitions) before passing them on to other agents. This is equivalent to clarifying 'shared state' in the distributed consensus problem and reduces semantic drift between agents.
- If you are using a workflow engine like Temporal, simply explicitly setting activity timeouts implements the message delay upper bound of the partial synchrony model. However, real consistency is guaranteed only if verification gates are passed even after retries, as LLM output may change after a retry.
- As the number of agents increases, a single supervisor agent will struggle to handle the entire context. In this case, introduce a hierarchical supervision structure (e.g., supervisor per subteam + top-level coordinator), but design with verification gates at each level boundary to distribute the bottleneck.
Terminology
Related Papers
Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization
여러 AI 에이전트가 협력할 때 '어느 라운드의 어느 에이전트'가 실패했는지 정확히 짚어내서 그 프롬프트만 고치는 최적화 프레임워크
Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction
LLM 기반 멀티 에이전트 시스템으로 C/C++ 코드의 보안 취약점을 자동으로 찾고 재현하는 FuzzingBrain V2 논문으로, AIxCC 2025 대회에서 40개 중 36개(90%) 취약점 탐지에 성공했다.
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
성공/실패 추론 트레이스를 비교해 짧은 자연어 인사이트를 뽑아내고, 단 5개 학습 샘플로도 GRPO보다 빠르게 모델 추론 성능을 올리는 비파라메트릭 알고리즘.
Claude Code as a Daily Driver: Claude.md, Skills, Subagents, Plugins, and MCPs
Claude Code를 터미널 AI 코딩 도구로 제대로 쓰기 위한 Claude.md 설정, 서브에이전트, 플러그인, MCP 연동 실전 가이드
FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents
금융 AI 에이전트가 실행 중간에 위험한 툴 호출을 차단하면서도 정상 승인율을 유지하는 인라인 안전 프레임워크
Retrying vs Resampling in AI Control
Claude Code처럼 의심 행동을 막고 재시도하는 방식이 오히려 공격자에게 힌트를 줘서 더 위험할 수 있다는 연구.