인간처럼 디버깅하기: Runtime Execution을 단계별로 검증하는 LLM Debugger (LDB)

Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step

Feb 25, 2024•Li Zhong, Zilong Wang, Jingbo Shang•View PDF

TL;DR Highlight

LLM이 생성한 코드를 브레이크포인트 방식으로 블록별 중간 변수값을 추적해 버그를 자동으로 찾고 수정하는 프레임워크

Who Should Read

LLM 기반 코드 생성/자동 디버깅 파이프라인을 구축 중인 개발자, 또는 GPT-4·CodeLlama로 코드를 생성한 뒤 품질을 높이고 싶은 AI 코딩 도구 개발자

Core Mechanics

기존 방법들은 코드를 '하나의 덩어리'로 보고 테스트 실패 메시지만 피드백으로 쓰지만, LDB는 실제 실행 중 블록별 중간 변수값을 추적해 정확한 버그 위치를 찾음
프로그램을 CFG(Control Flow Graph, 코드 실행 경로를 그래프로 표현한 것) 기반 Basic Block으로 분해하고, 각 블록 실행 전후 변수 상태를 LLM에게 보여줘 맞는지 검증하게 함
루프/재귀로 실행 trace가 길어질 경우 앞 5개 + 뒤 5개 블록만 샘플링하는 Selective Debugging으로 토큰 초과 문제 해결
블록별 검증 결과를 한 번에 배치로 쿼리하는 Batch Debugging으로 반복 디버깅 시 토큰 비용을 O(B²×N)에서 O(B×N)으로 줄임
GPT-4나 Reflexion이 생성한 고품질 코드에도 적용 가능 — 더 강력한 생성기가 놓친 버그까지 잡아냄 (Reflexion + LDB = HumanEval 95.1%)
버그의 약 80%는 Semantic Error(문법은 맞지만 의도와 다른 동작)였고, LDB는 실행 정보 덕분에 이런 버그를 특히 잘 잡음

Evidence

GPT-3.5 기준 HumanEval +9.1%, TransCoder +5.4%, MBPP +8.4% 향상 (기존 Self-Debugging 대비 일관되게 우위)
CodeLlama-34B 기준 TransCoder에서 +9.8%로 전체 실험 중 최대 향상폭 달성
Reflexion 생성 코드 + LDB(GPT-3.5) 조합으로 HumanEval 95.1%, GPT-4o 백본 사용 시 98.2% 달성
Bug Localization 정확도: HumanEval 93.7%, MBPP 95.3%, TransCoder 86.7% (GPT-4가 자동 검증)

How to Apply

LLM이 코드를 생성한 뒤 테스트 실패 시, Python의 sys.settrace()나 AST 분석으로 Basic Block별 실행 전후 변수 스냅샷을 수집하고, 이를 프롬프트에 담아 LLM에게 블록별 정/오답 판정을 JSON으로 받아오는 디버깅 루프를 구현하면 됨
이미 GPT-4나 Claude로 코드 생성 중이라면, 생성된 코드를 LDB 방식의 GPT-3.5 디버거로 후처리하는 파이프라인을 추가하면 비용은 낮추면서 품질을 더 높일 수 있음
긴 루프가 포함된 코드의 경우, 전체 실행 trace 대신 '첫 N개 + 마지막 N개 블록'만 샘플링해서 컨텍스트 길이 초과 없이 디버깅 프롬프트를 구성하면 됨

Code Example

snippet

# LDB 스타일 디버깅 프롬프트 예시 (Chat 모드)

system_prompt = "You are an expert programming assistant."

# Step 1: 코드 생성
user_msg_1 = """
Complete the following task in Python. Please respond with code only.
def is_sorted(lst):
    '''
    Given a list of numbers, return whether or not they are
    sorted in ascending order. If list has more than 1 duplicate
    of the same number, return False.
    '''
"""

# Step 2: 실패한 테스트 케이스와 블록별 실행 trace를 포함한 디버깅 요청
debugging_prompt = """
The code above fails the given unit test:
assert is_sorted([1, 2, 2, 3, 3, 4]) == True  # Real Execution Output: False

Here is the code execution trace block by block with intermediate variable values.
For EACH BLOCK, answer whether it is correct or not.
Return a JSON with keys: `block`, `correct`, `explanation`.

[BLOCK-0]
# lst=[1, 2, 2, 3, 3, 4]
for i in range(len(lst) - 1):
    if lst[i] > lst[i + 1]:
# i=0, lst=[1, 2, 2, 3, 3, 4]

[BLOCK-5]
# i=4, lst=[1, 2, 2, 3, 3, 4]
return not any(lst.count(x) > 1 for x in lst)
# i=4, lst=[1, 2, 2, 3, 3, 4], _ret=False
"""

# Step 3: 디버깅 결과 기반 코드 재생성
regeneration_prompt = "Please fix the Python code based on the debugging feedback above."

# Python으로 Basic Block별 변수 추적하는 간단한 예시
import sys

def trace_blocks(func, test_input):
    """함수 실행 중 각 라인의 로컬 변수 상태를 캡처"""
    states = []
    
    def tracer(frame, event, arg):
        if event == 'line':
            states.append({
                'line': frame.f_lineno,
                'locals': dict(frame.f_locals)
            })
        return tracer
    
    sys.settrace(tracer)
    try:
        result = func(test_input)
    finally:
        sys.settrace(None)
    
    return states, result

# 사용 예:
# states, result = trace_blocks(is_sorted, [1, 2, 2, 3, 3, 4])
# → states를 블록 단위로 묶어 LLM 프롬프트에 삽입

Terminology

Basic Block프로그램에서 '입구 하나, 출구 하나'인 연속된 코드 묶음. if문이나 루프로 분기되기 전까지의 직선 코드 구간.

CFG (Control Flow Graph)코드의 모든 실행 경로를 그래프로 표현한 것. 각 노드가 Basic Block이고, 화살표가 분기 방향. 프로그램이 어떤 순서로 실행되는지 한눈에 보여줌.

Execution Trace프로그램이 실제로 실행될 때 거쳐간 Basic Block들의 순서 기록. 같은 코드라도 입력값에 따라 다른 trace가 생김.

Pass@1LLM이 코드를 한 번 생성했을 때 테스트를 통과하는 비율. 높을수록 한방에 정답을 맞추는 확률이 높음.

Self-DebuggingLLM이 자기가 만든 코드를 스스로 설명하거나 dry-run(머릿속으로 실행)해보는 방식의 디버깅. 실제 실행 없이 LLM 추론에만 의존해서 복잡한 코드엔 한계가 있음.

ReflexionLLM이 자신의 실패를 언어로 기억하고 다음 시도에 반영하는 강화학습 스타일 프레임워크. 코드 생성에서 높은 성능을 보여 LDB의 비교 대상으로 사용됨.

Semantic Error코드 문법은 맞아서 실행은 되지만, 의도한 대로 동작하지 않는 버그. 예: `>1` 써야 할 곳에 `>2` 씀.

Related Resources

https://github.com/FloridSleeves/LLMDebugger

Original Abstract (Expand)

Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce Large Language Model Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their correctness against the task description block by block, and efficiently pinpoint any potential errors. Experiments demonstrate that LDB consistently enhances the baseline performance by up to 9.8% across the HumanEval, MBPP, and TransCoder benchmarks, archiving new state-of-the-art performance in code debugging for various LLM selections.