CoSPlay: 자기 생성 코드와 Unit Test로 하는 Test-Time Cooperative Self-Play

TL;DR Highlight

Ground Truth 없이도 코드와 Unit Test가 서로 평가하며 함께 품질을 높이는 추론 시간 최적화 프레임워크

Who Should Read

LLM 기반 코드 생성 파이프라인의 정확도를 높이려는 ML 엔지니어나 연구자. 특히 Ground Truth 테스트 케이스 없이 추론 시간에 코드 품질을 검증하고 싶은 개발자.

Core Mechanics

Ground Truth(정답 데이터) 없이 코드 후보와 Unit Test가 서로를 평가하며 함께 개선되는 'Cooperative Self-Play' 방식을 제안. 학습 없이 추론 시간에만 동작함.
Stage 1(탐색-공격 아이디어 생성): LLM에게 다양한 풀이 전략을 탐색하게 하고, 각 전략의 잠재적 실패 모드에서 Unit Test 아이디어를 유도해 처음부터 차별화된 테스트 풀을 구성.
Stage 2(실행 매트릭스 기반 Self-Play): 코드×UT 실행 매트릭스의 pass count를 신호로 삼아 ① 전부 실패하는 코드 제거, ② spurious coupling(잘못된 코드와 잘못된 UT가 우연히 맞는 현상) 차단, ③ 신뢰도 높은 UT로 코드 수정, ④ 변별력 잃은 UT 교체를 반복.
Stage 3(출력 합의 클러스터링): pass count가 동점인 코드들을 랜덤 입력으로 실행해 출력 시그니처가 같은 그룹으로 묶고, 가장 큰 클러스터의 코드를 최종 선택. 올바른 코드는 같은 출력을, 틀린 코드는 제각각의 출력을 낸다는 원리 활용.
UT pass count가 높을수록 해당 UT가 실제로 정확할 가능성이 높고, 코드 pass count가 높을수록 실제로 정답일 가능성이 높다는 상관관계를 이론과 실험으로 검증. Ground Truth 없이도 품질 신호로 사용 가능.
Self-play 중 코드와 UT의 pass count 분포가 라운드를 거치며 점차 높은 쪽으로 이동. 7B 기준 저품질 UT 비율이 약 29.9%, 저품질 코드 비율이 약 25.2% 감소함.

Evidence

Qwen2.5-7B-Instruct에 적용 시 평균 BoN(Best-of-N 정확도)이 22.1%→33.2%로 +11.1%p 향상. UT 정확도는 14.6%→78.3%로 +63.7%p 향상.
GT 데이터 4.5k로 학습한 RLVR 모델 CURE-7B(BoN 32.9%)를 학습 없이 동일 수준(33.2%)으로 따라잡음. CURE-7B에 CoSPlay를 추가 적용하면 BoN이 32.9%→38.6%로 추가 +5.7%p 향상.
682K 토큰 예산으로 37.2% pass@1 달성. 유사하거나 더 많은 예산을 쓰는 기존 TTS 기법(BoN N=256: 745K 토큰으로 22.5%, ThinkCoder Round20: 1.16M 토큰으로 27.8%)보다 높은 성능.
DeepSeek-V3.2-685B 같은 대형 모델에도 효과적: 평균 BoN 65.7%→68.2%, 특히 가장 어려운 CodeForces 벤치마크에서 39.3%→50.0%로 +10.7%p 향상.

How to Apply

코드 생성 파이프라인에 GT 테스트가 없는 상황이라면, LLM에게 풀이 전략 여러 개를 먼저 생성시키고, 각 전략에서 예상 실패 케이스를 추출해 Unit Test 입력으로 활용하라. 직접 UT를 생성하는 것보다 초기 UT 품질이 훨씬 높아진다(UT 정확도 12.5%→37.2%).
코드 후보가 여러 개 생성된 경우, pass count 매트릭스를 만들어 '모든 UT를 통과한 코드'나 '아무 코드도 통과 못 한 UT'를 걸러내는 반복 루프를 추가하라. 라운드마다 품질이 올라가며 최대 5라운드만으로도 유의미한 개선이 가능하다.
Best-of-N 방식으로 동점 코드가 여러 개 남은 경우, 랜덤 유효 입력 R개로 각 코드를 실행해 출력 벡터가 같은 것끼리 클러스터링하고, 가장 큰 클러스터에서 최종 코드를 선택하라. 클러스터 크기 자체가 정확도의 신뢰할 수 있는 프록시다.

Code Example

snippet

# CoSPlay 핵심 로직 스케치 (Python pseudo-code)

def cosplay(problem: str, llm, Nc=16, Nt=16, Tmax=5, R=16):
    # Stage 1: 아이디어 탐색
    hints = llm.generate(f"Generate high-level solution hints for: {problem}")
    plans = [llm.generate(f"Expand plan for hint: {h}") for h in hints]
    attack_ideas = [llm.generate(f"What edge cases / failure modes for plan: {p}?") for p in plans]

    # 코드 & UT 풀 초기화
    codes = [llm.generate_code(problem, plan=p) for p in plans[:Nc]]
    uts = []
    for i in range(Nt):
        if i < Nt // 2:
            inp = llm.generate(f"Generate random valid input for: {problem}")
        else:
            inp = llm.generate(f"Generate input targeting failure: {attack_ideas[i % len(attack_ideas)]}")
        # 자기 일관성 필터: 4번 샘플 중 3번 이상 동의하면 채택
        outputs = [llm.solve(problem, inp) for _ in range(4)]
        if outputs.count(max(set(outputs), key=outputs.count)) >= 3:
            uts.append((inp, max(set(outputs), key=outputs.count)))

    # Stage 2: Self-Play 반복
    for _ in range(Tmax):
        M = [[execute(c, ut[0]) == ut[1] for ut in uts] for c in codes]
        code_pass = [sum(row) for row in M]
        ut_pass = [sum(M[i][j] for i in range(len(codes))) for j in range(len(uts))]

        # Step 1: 0 pass 코드 제거 후 재샘플
        codes = [llm.generate_code(problem) if code_pass[i] == 0 else codes[i]
                 for i in range(len(codes))]

        # Step 2: 가장 낮은 non-trivial UT 교체 (spurious coupling 제거)
        # Step 3: 가장 신뢰도 높은 non-trivial UT로 실패 코드 수정
        # Step 4: 0% / 100% pass UT 교체
        # ... (각 스텝 후 M 재계산)

    # Stage 3: 클러스터 선택
    top_codes = [c for i, c in enumerate(codes) if code_pass[i] == max(code_pass)]
    random_inputs = [llm.generate(f"Random valid input: {problem}") for _ in range(R)]
    signatures = {c: tuple(execute(c, z) for z in random_inputs) for c in top_codes}

    clusters = {}  # signature -> list of codes
    for c, sig in signatures.items():
        placed = False
        for key in clusters:
            if all(sig[k] == key[k] or sig[k] == 'ERR' or key[k] == 'ERR'
                   for k in range(R)):
                clusters[key].append(c)
                placed = True
                break
        if not placed:
            clusters[sig] = [c]

    best_cluster = max(clusters.values(), key=len)
    return best_cluster[0]

Terminology

Ground Truth (GT)정답 데이터. 예: 공식 출제자가 만든 입출력 테스트 케이스. 이게 없으면 코드가 맞는지 검증하기 어려움.

Unit Test (UT)특정 입력을 넣었을 때 예상 출력이 나오는지 확인하는 테스트 케이스. 코드 생성에서는 '이 입력에는 이 답이 나와야 한다'는 쌍(input, expected_output).

RLVR (Reinforcement Learning with Verifiable Rewards)검증 가능한 보상으로 강화학습하는 방법. 코드 실행 결과가 맞으면 보상을 주는 식. 학습에 GT 데이터가 필요하고 비용이 큼.

TTS (Test-Time Scaling)추론 시간에 더 많은 계산을 써서 성능을 높이는 전략. 학습은 고정하고 답을 여러 번 생성하거나 검증하는 방식.

BoN (Best-of-N)N개의 후보를 생성하고 그 중 가장 좋은 것을 고르는 방법. N을 늘릴수록 정답 포함 확률은 높아지지만 선택 기준이 문제.

Pass Count코드 후보가 몇 개의 UT를 통과했는지, 또는 UT 하나를 몇 개의 코드가 통과했는지를 나타내는 수치. CoSPlay에서 품질 추정의 핵심 신호.

Spurious Coupling틀린 코드와 틀린 UT가 우연히 서로 통과시켜 주는 현상. 예: 둘 다 같은 잘못된 로직을 공유해서 잘못된 쌍이 '맞다'고 판단되는 오류.

Self-Consistency같은 질문을 여러 번 LLM에 물어봐서 답이 일치하면 그 답을 신뢰하는 기법. 다수결로 노이즈를 걸러내는 방식.

Related Resources

Original Abstract (Expand)

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through executable verification. Yet Ground-Truth Unit Tests (GT UTs) remain a bottleneck: SOTA RLVR methods require them for costly training, while existing TTS methods lose competitiveness without them. This motivates GT-free TTS, where existing methods directly use self-generated UTs to refine and select code candidates. Yet such UTs are often noisy or spuriously coupled with wrong code, and UT quality in turn cannot be validated without reliable code. The key challenge is therefore to jointly improve both. To this end, we present CoSPlay, a GT-free, training-free framework that jointly improves codes and UTs through cooperative self-play. It first explores diverse solution ideas and identifies their potential failure modes to produce discriminative UT ideas. It then uses bidirectional pass-count signals from the Code-UT execution matrix to iteratively prune or fix weak codes and refresh or replace unreliable UTs, letting the two pools co-evolve. Finally, when multiple codes remain tied at the highest pass count, it picks the final code from the largest output-consensus cluster, since correct codes agree on the same inputs while wrong codes diverge. Experiments on four challenging benchmarks show that CoSPlay on Qwen2.5-7B-Instruct improves average BoN from 22.1% to 33.2% and UT accuracy from 14.6% to 78.3%, matching or surpassing the RLVR model CURE-7B. When applied to CURE-7B, it further improves BoN by 5.7%. CoSPlay also generalizes across diverse backbones and outperforms GT-free TTS baselines under comparable token budgets, with continued gains as the budget scales up. These results suggest a scalable inference strategy for competitive code generation without any GT data.