CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation
TL;DR Highlight
A multi-agent framework that co-evolves plans and code, simultaneously achieving 11-20% higher accuracy and a 4-10 reduction in API calls compared to existing methods.
Who Should Read
AI engineers designing or improving LLM-based code generation pipelines. Specifically, developers looking to enhance the performance of debugging agents in complex programming problems.
Core Mechanics
- The core problem of existing multi-agent code generation systems is that 'they keep fixing the code even if the plan is wrong.' CollabCoder introduces a CDM (Collaborative Decision-Making) module that dynamically decides whether to update the plan or the code at each iteration.
- CDM performs three perspectives (plan analysis, code analysis, plan-code consistency analysis) simultaneously and makes consensus-based decisions based on the confidence weights of each analysis (wπ=0.4, wc=0.3, walign=0.3).
- The RT (Reasoning Trajectory) module accumulates past debugging history to guide the next correction direction. Unlike existing methods that debug from scratch every time, it remembers failure patterns to reduce repetitive mistakes.
- Code-specialized models (Seed-Coder-8B, Qwen2.5-Coder-32B) choose code-level modifications 2-3 times more often, while general-purpose models (GPT-4o mini) choose plan-level modifications much more frequently. Debugging strategies automatically vary depending on the model characteristics.
- The benefits of CollabCoder are more pronounced in difficult problems at the competitive programming level. The difference is small in easy sections, but CollabCoder solves 7 problems in the difficult section (1600-1800) compared to MapCoder (3 problems) and CodeSIM (5 problems).
- The same trend is maintained in the latest frontier models such as GPT-5.2 and Qwen3-Coder-Next (80B). The accuracy gap narrows, but consistently outperforms in terms of API calls and token usage.
Evidence
- "Achieved 6.6-7.1%p higher Pass@1 than MapCoder and 4.7-5.3%p higher than CodeSIM on LiveCodeBench and xCodeEval, based on GPT-4o mini. Simultaneously reduced token consumption by 57% compared to MapCoder and 42% compared to CodeSIM.\nOn LiveCodeBench, with an inference budget of 10 API calls, CollabCoder achieved 33.93% vs MapCoder 30.36% vs CodeSIM 31.25%. At budget t=5, CollabCoder solved 44/90 problems, while Reflexion stagnated at 37/90 and Best-of-N at 33/90.\nOn basic benchmarks (HumanEval, MBPP), with Qwen2.5-Coder-32B as the base, CollabCoder averaged 82.50% vs CodeSIM 80.22% vs MapCoder 79.84%, with 4.12 API calls, less than half of MapCoder (9.05 calls).\nRemoving CDM lowered the average accuracy of Seed-Coder-8B by 4.24%p, and removing RT lowered it by 3.36%p. Both modules contribute independently to performance and achieve the best performance when used together."
How to Apply
- If you have an existing debugging loop that repeatedly modifies only the code, add a step to determine 'is the problem with the plan vs. the implementation' with a separate LLM call at each iteration. You can simply mimic CollabCoder's CDM by requesting the three perspectives of plan analysis, code analysis, and consistency analysis in prompts and deciding by majority vote.
- Add a history memory to your debugging agent. Summarizing 'what modifications were attempted and why they failed' at each iteration as text (Reasoning Trajectory) and including it in the next prompt can reduce the rate of repeating the same mistakes.
- When using code-specialized models (e.g., Qwen2.5-Coder), set a higher weight for plan updates. According to the paper, these models tend to fix only the code even if the plan is wrong, so intentionally increasing wπ (e.g., to 0.5 or higher) can better induce plan-level modifications.
Code Example
Terminology
Related Resources
Original Abstract (Expand)
Automated code generation remains a persistent challenge in software engineering, as conventional multi-agent frameworks are often constrained by static planning, isolated execution, high computational overhead, and limited adaptability to complex tasks. This paper introduces CollabCoder, a novel Plan-Code Co-Evolution framework that improves code generation through dynamic multi-agent collaboration. The core idea is to design a collaborative decision-making process between the plan module and the code module to decide which module should be executed for the debugging process. Extensive experiments on widely used benchmarks demonstrate that CollabCoder consistently improves code quality and robustness across tasks. Importantly, CollabCoder achieves performance comparable to or exceeding current state-of-the-art methods while reducing computational overhead, with efficiency gains becoming more pronounced as benchmark difficulty increases. On the more challenging LiveCodeBench and xCodeEval benchmarks, our approach improves performance by 11-20% over strong baselines while reducing the number of API calls by an average of 4-10 per execution.