MaxCode: 코드 자동 최적화를 위한 Max-Reward Reinforcement Learning 프레임워크

MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization

Jan 9, 2026•Jiefu Ou, Sapana Chaudhary, Kaj Bostrom +4•View PDF

TL;DR Highlight

LLM이 CUDA 커널·C++ 코드를 자동으로 최적화할 때, 실행 피드백을 자연어 비평으로 바꾸고 RL로 탐색을 안내해 기존 대비 최대 27% 속도를 더 끌어낸다.

Who Should Read

PyTorch 모델의 CUDA 커널 최적화나 고성능 C++ 코드 개선을 자동화하려는 ML 인프라/시스템 개발자. LLM 기반 코드 생성 에이전트에 실행 피드백 루프를 붙이는 방법을 고민하는 AI 엔지니어.

Core Mechanics

코드 최적화 문제를 'max-reward RL(누적 보상이 아니라 최고 보상을 목표로 하는 RL)'로 재정의해서, 지금까지 찾은 최고 성능 코드를 항상 컨텍스트로 유지
실행 결과(숫자)만 주는 대신 별도 critique 모델(Claude-3.7-Sonnet)이 '메모리 대역폭 병목', '잘못된 연산 순서' 같은 자연어 진단을 생성해 LLM에게 넘김
Reward-to-go 모델(Qwen2.5-7B-Instruct 파인튜닝)이 탐색 경로의 '미래 기대 최대 속도'를 예측해서, 실제 실행 없이도 유망한 후보를 먼저 골라낼 수 있음
기존 탐색 방법(CUDA-LLM, Effi-Learner)을 이 프레임워크에 꽂으면 별도 재훈련 없이 바로 성능 향상 — 플러그인 방식으로 동작
같은 연산을 '순차 서브커널 체인'과 '전체 퓨전'이라는 완전히 다른 구조로 짜도 거의 동일한 속도가 나올 수 있어서, 단순 빠름/느림 피드백만으로는 최적화 방향을 잡기 어렵다는 문제를 critique가 해결
inference budget(탐색 깊이)이 늘어날수록 MaxCode가 기존 방법보다 더 빠르게 성능이 오르는 스케일링 효과 확인

Evidence

CUDA-LLM + MaxCode 조합이 KernelBench Level 1에서 2.49x → 3.17x (27.3% 상대 향상), Level 2에서 1.45x → 1.61x (11.0% 향상)
PIE(C++ 최적화) 벤치마크에서 CUDA-LLM + MaxCode가 1.42x → 1.74x (22.5% 상대 향상), 평균 랭킹도 2.05 → 1.74로 개선
MaxCode 전체(Traj Critique Best Perf)가 단일 컴포넌트(Critique만, Best Perf만)보다 일관되게 높은 max speedup — 조합 효과가 핵심
Reward-to-go 모델은 KernelBench L2와 PIE에서 랭킹 개선(1.57→1.33, 1.55→1.43) 확인, 단 L1에서는 분포 불일치로 성능 저하

How to Apply

기존 LLM 코드 최적화 루프에 critique 단계 추가: 실행 결과를 그대로 프롬프트에 붙이지 말고, 별도 LLM(Claude-3.7-Sonnet extended thinking 활성화)에 '이 코드의 병목이 뭔지 진단하고 개선 방향 제시해줘'를 먼저 태워서 그 자연어 비평을 메인 프롬프트에 넣기
반복 최적화 루프에서 '지금까지 가장 빠른 버전의 코드+피드백'을 프롬프트에 항상 포함: LLM이 현재 시도가 최고 기록 대비 어느 수준인지 인식하게 해서 무의미한 퇴행을 줄임
탐색 후보가 많아서 GPU 실행 비용이 부담될 때 Qwen2.5-7B 급 소형 모델을 LoRA로 파인튜닝해 reward predictor로 활용 — 실행 전에 유망도 낮은 후보를 사전 필터링하는 용도로만 쓰기

Code Example

snippet

# MaxCode의 핵심 프롬프트 구조 (Traj Critique Best Perf 버전)

GENERATOR_PROMPT = """
You write custom CUDA kernels to replace the pytorch operators in the given architecture to get speedups.

You are provided with:
1. The pytorch architecture to optimize
2. Your BEST-PERFORMING optimization so far and its execution feedback
3. Your TRAJECTORY of previous attempts with execution feedback
4. NATURAL LANGUAGE CRITIQUES for each attempt

Given this information, refine your optimization:
- If compiled=False: fix compilation errors (refer to best-performing solution for cues)
- If correctness=False: fix logic errors
- If correct: reduce runtime below the best-performing solution so far

IMPLEMENT CUDA OPERATORS using:
from torch.utils.cpp_extension import load_inline
"""

CRITIQUE_PROMPT = """
Given the optimization attempt and execution feedback:
1. Diagnose: What are the performance bottlenecks? (memory bandwidth? compute utilization? algorithmic inefficiency?)
2. Suggest: Provide actionable steps to fix or improve performance
   - If compile error: explain why it fails and how to fix
   - If correct but slow: identify specific bottleneck and optimization strategy
"""

# 사용 예시 (pseudo-code)
def maxcode_loop(initial_code, max_depth=8, k_candidates=8):
    best_code, best_speedup = initial_code, 1.0
    trajectory = []
    
    for depth in range(max_depth):
        # 1. k개 후보 생성
        candidates = [llm.generate(GENERATOR_PROMPT, 
                                    arch=initial_code,
                                    best=(best_code, best_speedup),
                                    trajectory=trajectory) 
                      for _ in range(k_candidates)]
        
        # 2. 실행해서 속도 측정
        results = [execute_and_measure(c) for c in candidates]
        
        # 3. critique 생성
        critiques = [llm.generate(CRITIQUE_PROMPT, code=c, feedback=r) 
                     for c, r in zip(candidates, results)]
        
        # 4. 가장 빠른 것 선택
        best_idx = max(range(k_candidates), key=lambda i: results[i].speedup)
        if results[best_idx].speedup > best_speedup:
            best_code = candidates[best_idx]
            best_speedup = results[best_idx].speedup
        
        trajectory.append((candidates[best_idx], results[best_idx], critiques[best_idx]))
    
    return best_code, best_speedup

Terminology

max-reward RL일반 RL이 '총 점수 합계'를 최대화한다면, max-reward RL은 '한 번이라도 가장 높은 점수'를 목표로 함. 코드 최적화에서는 여러 시도 중 단 한 번이라도 가장 빠른 코드를 찾으면 성공이므로 이 방식이 적합.

CUDA 커널GPU에서 직접 실행되는 연산 코드. PyTorch는 내부적으로 범용 CUDA 커널을 쓰는데, 특정 연산에 맞게 직접 짜면 수 배 빠르게 만들 수 있음. CUDA 커널 개발은 GPU 아키텍처 이해가 필요해서 난이도가 높음.

Reward-to-go현재 상태에서 앞으로 얻을 수 있는 최대 보상을 미리 예측하는 모델. 장기적으로 더 좋은 결과를 낼 것 같은 후보를 실행 전에 골라내는 역할.

critique 모델코드 실행 결과를 받아서 '왜 느린지', '어디를 고쳐야 하는지'를 사람이 읽을 수 있는 언어로 설명해주는 LLM. 숫자 피드백만 있을 때보다 최적화 방향을 훨씬 잘 잡아줌.

KernelBenchLLM의 CUDA 커널 최적화 능력을 평가하는 벤치마크. 250개 신경망 연산을 PyTorch로 주고, 더 빠른 CUDA 코드로 바꾸는 과제. 단순 연산부터 전체 HuggingFace 아키텍처까지 4단계 난이도.

PIE경쟁 프로그래밍 수준의 C++ 코드 쌍(느린 버전 / 빠른 버전) 77K개로 구성된 최적화 벤치마크. C++ 알고리즘 최적화 능력을 평가.

LoRA모델 전체 파라미터를 바꾸지 않고 작은 행렬 두 개만 추가해서 학습하는 파인튜닝 기법. 학습 비용이 풀 파인튜닝의 1/10 이하로 줄어들어서 작은 모델을 특정 태스크에 맞게 빠르게 적응시킬 때 씀.

Related Resources

Original Abstract (Expand)

Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution feedback into diagnostic insights about errors and performance bottlenecks, and the best-discounted reward seen so far. Together, these provide richer input to the code proposal function. To improve exploration during search, we train a generative reward-to-go model using action values from rollouts to rerank potential solutions. Testing on the KernelBench (CUDA) and PIE (C++) optimization benchmarks shows that MaxCode improves optimized code performance compared to baselines, achieving 20.3% and 10.1% relative improvements in absolute speedup value and relative speedup ranking, respectively.